How to extract titles from web pages using Python and save them in a CSV file

Consider the following task: We have a file with a list of urls (urls.txt).
1. First, we want to determine the HTTP response status codes. Read more here about HTTP response status codes

2. Then, for the requests that were successful, we need to take the status code and the title and save them in a CSV file.

I will use the following libraries:

  • CSV file library, for CSV File Reading and Writing
  • Requests, Requests is an elegant and simple HTTP library for Python, built for human beings.
  • re module, for regular expression operations

How to get the HTTP response status codes with Python and requests

import requests

url = "..."
resp = requests.get(url, allow_redirects=False)
code = resp.status_code

It is important to set allow_redirects=False if you don’t want requests to handle redirections.

How to get the page title with requests?

I will use the re.findall() method, which will return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Also, the empty matches are included in the result.

import requests

url = "..."
resp = requests.get(url, allow_redirects=False)
html  = resp.text
title = re.findall('<title>(.*)</title>', html)
title = str(title[0])

How to write in the CSV file?

I will use the csv.writer(csvfile, dialect=’excel’, **fmtparams) method, which return a writer object responsible for converting the data into delimited strings on the given file-like object. csvfile can be any object with a write() method. If csvfile is a file object, it should be opened with newline=”.

You have to be careful if the title contains commas, they can break the CSV file columns.

For this, it is necessary to escape the commas in the string title.

How to escape commas in a CSV file?

See my solution below:

import requests, re
import csv

with open('urls.csv', 'w') as df:

    writer = csv.writer(df, delimiter=';', lineterminator='\n')
    fhand = open('urls.txt')

    for line in fhand:
        line = line.rstrip()
        resp = requests.get(line, allow_redirects=False)
        code = resp.status_code
        if code == 200:
            html  = resp.text
                title = re.findall('<title>(.*)</title>', html)
                title = str(title[0])

                # The next line is required because the title contains commas and breaks columns in the CSV
                title = '"' + title + '"'                      
            except Exception as e:
            print(line, " - ", code)
            writer.writerow([line, code, title)

If you have another solution I would appreciate you posting it in the comments, thanks!

See other solutions:

Leave a Comment

How to? Ask a Question