Photo by Clément H on Unsplash

Web Scraping with Python using BeautifulSoup

How to parse and extract data from HTML documents in simple steps

Rafael Bastos
Analytics Vidhya
Published in
4 min readJul 14, 2020

--

One of the main concerns we have when starting a new project is how to obtain the data we’ll be working on. Companies like Airbnb and Twitter, for instance, simplify this task by providing APIs so we can compile information in an organized way. On other occasions, we can download the structured dataset already cleaned and ready to use, as in some Kaggle competitions. However, not rarely, we’ll need to explore the web to find and extract the data we want.

That’s when web scraping comes in handy. The idea is to extract information from a website and convert it for practical analysis. While there are several tools available for this purpose, in this article we’ll be using BeautifulSoup, a Python library designed for easily pulling data out of HTML and XML files.

Here, we’ll visit this Wikipedia page that contains several lists of best-selling books and extract the second table, for books between 50 million and 100 million copies sold.

Libraries needed

We only need 2 packages to handle the HTML file. We’ll be also using Pandas to create a data frame from the extracted data:

  • requests - Allows us to send HTTP requests and download the HTML code from the webpage;
  • beautifulsoup - Used to pull data out of the raw HTML file;
  • pandas - Python library for data manipulation. We'll use it to create our data frame.

Extracting the HTML file

To extract the raw HTML file, we simply pass the website URL into the request.get() function.

We now have an unstructured text file, containing the HTML code extracted from the URL path we passed.

Let’s take a look:

HTML code extracted from the web page

The way requests delivers the HTML code output is quite messy for analysis. That’s when we can get help from BeautifulSoup.

Creating a BeautifulSoup object

Now, we can start to work with BeautifulSoup. Let’s generate a BeautifulSoup object called soup, passing the html_text file created above.

Next, we can use a function called prettify() to shape the object in a structured format.

Notice below how the formatted file is easier to read and work on, compared to when we first generated the raw html_text file.

Structured BeautifulSoup object

Inspecting the Wikipedia page

On the Wikipedia page, let’s inspect the elements of the web page. (In Windows, press Ctrl + Shift + I. In Mac, press Cmd + Opt + I)

Inspecting the web page’s elements

Notice that all tables have a class of wikitable sortable. We can take advantage of that to select all tables in the HTML file.

Extracting the table

We are saving the tables in a variable called wiki_tables, using the method find_all() to search for all HTML table tags, with a class of wikitable sortable.

As we want the second table on the page (Between 50 million and 100 million copies), let’s narrow down our search to the second wiki_tables element. Let's also extract each row tr in that table.

Now, we’ll create an empty list called table_list, and append the elements of each table cell td into table_list.

We have successfully extracted that second table from the website into a list and we’re all set up to start analyzing the data.

Creating a pandas DataFrame

Finally, we can simply convert the list into a Pandas DataFrame to visualize the data we extracted from Wikipedia.

Pandas DataFrame created from the Wikipedia table

That’s it! With a few steps and some lines of code we now have a data frame extracted from an HTML table, ready for analysis. Well, there are still some adjustments that could be made, such as removing the square brackets references in the approximate sales columns, but the web scraping is done!

For the full code, please refer to the notebook.

--

--