Web scraping with Python and BeautifulSoup

Détails: 25 avril 2021; 4318

Beautiful Soup Let's go web scraping with Python and BeautifulSoup! Here BeautifulSoup will be able to extract formatted data from HTML and its CCS code. For confidentiality reasons, we will not name the parsed website, but a lot are built like that: pages containing record from a database use their ID in the URL. So you will to adapt the codes below to your own website and purposes.

We will use a complex website as example, where the IDs to guess are listed in a large pagination.

First, we will get the links to articles (with IDs in URLs) from all pages of the pagination, then we will store data in a CSV file, if interesting (we search emails). So we will do a loop in a loop, with a condition, but if the website you want to parse is simpler (without pagination), you will not need the first loop.

Libraries and receptacle

We start by importing some necessary libraries and creating the receptacle file for our data.

from bs4 import BeautifulSoup
import requests
import re
import csv
 
# CSV file to receive data
out_file = open('C:/python_projects/webscraping/email.csv', 'w', encoding='cp1252')
out_file.write("email" + "\n")

requests will serve to get the HTML content. BeautifulSoup will organize the HTML content.

re means regular expression, we will use it later, to run a RegEx to recognize the emails in the HTML content. And csv to ...

Pagination

Now let's take a look at the shape of the site itself. Wanted emails are in articles available in a pagination. The URL looks like this:

https://www.website.com/?term=[a query...]&sort=&page=[a pagination number]

The first results from a search page containing links to the articles. Examining this search page, we see that the pagination goes until 500 pages.

So we have to iterate through the pagination from the first to the last page. Let's build our URLs concatening the beginning of the search URL with the pagination number, all in a loop.

# Prepare URL
urlpage_prefix = 'https://www.website.com/?term=[a query...]&amp;sort=&amp;page='
urlpage_suffix = 0
 
# Get the page id
while urlpage_suffix <= 500 :
    urlpage_suffix += 1
 
    # URL build
    page = requests.get(urlpage_prefix+str(urlpage_suffix))

First parsing

Then we have to get the page ID from all the blocks of the pagination. In the first loop, we catch an attribute of the block (a CSS class).

    # Parsing
    soup1 = BeautifulSoup(page.content)
    a_CSS_class = soup1.find_all(attrs={'class': 'a_CSS_class'})

In a new loop, we find the ID an article, and build with it a new URL, to the article. Indeed in our example, the final URL of an article looks like:

https://www.website.com/[an ID]

    for x in a_CSS_class:
        # New URL build
        ArticleId = x.get('data-article-id')
        urlArticle_prefix = 'https://www.website.com/'
        Article = requests.get(str(urlArticle_prefix)+str(ArticleId))

Second parsing

Now we parse data in the new URL, catching a new CSS class, the one that contains the email.

        # New parsing
        soup2 = BeautifulSoup(Article.content)
        another_CSS_class = soup2.find_all(attrs={'class': 'another_CSS_class'})

RegEx, cleaning and insertion

All that's left is to check if our CSS class contains email (with the RegEx re), clean the line with the email (some replace) and insert it in our CSV file.

        emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", str(another_CSS_class))
 
        if emails:
            line = '{}\n'.format(emails)
            emails
            newLine = line.replace("'", "").replace("[", "").replace("]", "")
            out_file.write(newLine)
 
out_file.close()

Menu