Let's go web scraping with Python and BeautifulSoup! Here BeautifulSoup will be able to extract formatted data from HTML and its CCS code. For confidentiality reasons, we will not name the parsed website, but a lot are built like that: pages containing record from a database use their ID in the URL. So you will to adapt the codes below to your own website and purposes.
We will use a complex website as example, where the IDs to guess are listed in a large pagination.
First, we will get the links to articles (with IDs in URLs) from all pages of the pagination, then we will store data in a CSV file, if interesting (we search emails). So we will do a loop in a loop, with a condition, but if the website you want to parse is simpler (without pagination), you will not need the first loop.
Libraries and receptacle
We start by importing some necessary libraries and creating the receptacle file for our data.
from bs4 import BeautifulSoup import requests import re import csv # CSV file to receive data out_file = open('C:/python_projects/webscraping/email.csv', 'w', encoding='cp1252') out_file.write("email" + "\n")
requests
will serve to get the HTML content. BeautifulSoup
will organize the HTML content.
re
means regular expression, we will use it later, to run a RegEx to recognize the emails in the HTML content. And csv
to ...
Pagination
Now let's take a look at the shape of the site itself. Wanted emails are in articles available in a pagination. The URL looks like this:
https://www.website.com/?term=[a query...]&sort=&page=[a pagination number]
The first results from a search page containing links to the articles. Examining this search page, we see that the pagination goes until 500 pages.
So we have to iterate through the pagination from the first to the last page. Let's build our URLs concatening the beginning of the search URL with the pagination number, all in a loop.
# Prepare URL urlpage_prefix = 'https://www.website.com/?term=[a query...]&sort=&page=' urlpage_suffix = 0 # Get the page id while urlpage_suffix <= 500 : urlpage_suffix += 1 # URL build page = requests.get(urlpage_prefix+str(urlpage_suffix))
First parsing
Then we have to get the page ID from all the blocks of the pagination. In the first loop, we catch an attribute of the block (a CSS class).
# Parsing soup1 = BeautifulSoup(page.content) a_CSS_class = soup1.find_all(attrs={'class': 'a_CSS_class'})
In a new loop, we find the ID an article, and build with it a new URL, to the article. Indeed in our example, the final URL of an article looks like:
https://www.website.com/[an ID]
for x in a_CSS_class: # New URL build ArticleId = x.get('data-article-id') urlArticle_prefix = 'https://www.website.com/' Article = requests.get(str(urlArticle_prefix)+str(ArticleId))
Second parsing
Now we parse data in the new URL, catching a new CSS class, the one that contains the email.
# New parsing soup2 = BeautifulSoup(Article.content) another_CSS_class = soup2.find_all(attrs={'class': 'another_CSS_class'})
RegEx, cleaning and insertion
All that's left is to check if our CSS class contains email (with the RegEx re
), clean the line with the email (some replace
) and insert it in our CSV file.
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", str(another_CSS_class)) if emails: line = '{}\n'.format(emails) emails newLine = line.replace("'", "").replace("[", "").replace("]", "") out_file.write(newLine) out_file.close()