Consider the following task: We have a file with a list of urls (urls.txt).
1. First, we want to determine the HTTP response status codes. Read more here about HTTP response status codes
2. Then, for the requests that were successful, we need to take the status code and the title and save them in a CSV file.
I will use the following libraries:
- CSV file library, for CSV File Reading and Writing
- Requests, Requests is an elegant and simple HTTP library for Python, built for human beings.
- re module, for regular expression operations
How to get the HTTP response status codes with Python and requests
import requests url = "..." resp = requests.get(url, allow_redirects=False) code = resp.status_code
It is important to set allow_redirects=False if you don’t want requests to handle redirections.
How to get the page title with requests?
I will use the re.findall() method, which will return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Also, the empty matches are included in the result.
import requests url = "..." resp = requests.get(url, allow_redirects=False) html = resp.text title = re.findall('<title>(.*)</title>', html) title = str(title)
How to write in the CSV file?
I will use the csv.writer(csvfile, dialect=’excel’, **fmtparams) method, which return a writer object responsible for converting the data into delimited strings on the given file-like object. csvfile can be any object with a write() method. If csvfile is a file object, it should be opened with newline=”.
You have to be careful if the title contains commas, they can break the CSV file columns.
For this, it is necessary to escape the commas in the string title.
How to escape commas in a CSV file?
See my solution below:
import requests, re import csv with open('urls.csv', 'w') as df: writer = csv.writer(df, delimiter=';', lineterminator='\n') fhand = open('urls.txt') for line in fhand: line = line.rstrip() resp = requests.get(line, allow_redirects=False) code = resp.status_code if code == 200: html = resp.text try: title = re.findall('<title>(.*)</title>', html) title = str(title) # The next line is required because the title contains commas and breaks columns in the CSV title = '"' + title + '"' except Exception as e: print(e) print(line, " - ", code) writer.writerow([line, code, title)
If you have another solution I would appreciate you posting it in the comments, thanks!