Intro
In the era of data science it is common to collect data from websites for analytics purposes.
is one of the most commonly used programming languages for data science projects. Using with makes scrapping easier. Knowing how to scrap pages will save your time and money.

Prerequisite

  1. Basics of python programming (python3.x).
  2. Basics of html tags.

Installing required modules
First thing first, assuming python3.x is already install on your system you need to install requests http library and beautifulsoup4 module.

Install requests and beautifulsoup4

$ pip install requests
$ pip install beautifulsoup4

Collecting web page data

Now we are ready to go. In this tutorial our goal is to get the list of presidents of United States from this wikipedia page.
Go to this link and right click on the table containing all the information about the United States presidents and then click on the inspect to inspect the page (I am using Chrome. Other browsers have similar option to inspect the page).

Screen Shot 2018-10-07 at 11.38.39 PM.png

The table content is within the tag table and class wikitable (see the image below). We will need these information to extract the data of interest.
Screen Shot 2018-10-07 at 9.25.01 PM.png

Import the installed modules

import requests
from bs4 import BeautifulSoup

To get the data from the web page we will use requests API’s get() method

url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)

It is always good to check the http response status code

print(page.status_code)   # This should print 200

Now we have collected the data from the web page, let’s see what we got

print(page.content)

The above code will display the http response body.
The above data can be view in a pretty format by using beautifulsoup‘s prettify() method. For this we will create a bs4 object and use the prettify method

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

This will print data in format like we have seen when we inspected the web page.

<table class="wikitable" style="text-align:center;">
      <tbody>
       <tr>
        <th colspan="9">
         <span style="margin:0; font-size:90%; white-space:nowrap;">
          <span class="legend-text" style="border:1px solid #AAAAAA; padding:1px .6em; background-color:#DDDDDD; color:black; font-size:95%; line-height:1.25; text-align:center;">
          </span>
          <a href="http://www.codementor.io/wiki/Independent_politician" title="Independent politician">
           Unaffiliated
          </a>
          (2)
         </span>
         <span style="margin:0; font-size:90%; white-space:nowrap;">
         ...
         ...

As of now we know that our table is in tag table and class wikitable. So, first we will extract the data in table tag using find method of bs4 object. This method returns a bs4 object

tb = soup.find('table', class_='wikitable')

This tag has many nested tags but we only need text under title element of the tag a of parent tag b (which is the child tag of table). For that we need to find all b tags under the table tag and then find all the a tags under the b tags. For this we will use find_all method and iterate over each of the b tag to get the a tag

for link in table.find_all('b'):
    name = link.find('a')
    print(name)  

This will extract data under all the a tags

<a href="http://www.codementor.io/wiki/George_Washington" title="George Washington">George Washington</a>
<a href="http://www.codementor.io/wiki/John_Adams" title="John Adams">John Adams</a>
<a href="http://www.codementor.io/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>
<a href="http://www.codementor.io/wiki/James_Madison" title="James Madison">James Madison</a>
<a href="http://www.codementor.io/wiki/James_Monroe" title="James Monroe">James Monroe</a>
...
...
<a href="http://www.codementor.io/wiki/Barack_Obama" title="Barack Obama">Barack Obama</a>
<a href="http://www.codementor.io/wiki/Donald_Trump" title="Donald Trump">Donald Trump</a>

The eleemnt title can be extracted from all a tags using the method get_text(). So modifyng the above code snippet

for link in table.find_all('b'):
    name = link.find('a')
    print(name.get_text('title'))

and here is the desired result

George Washington
John Adams
Thomas Jefferson
James Madison
James Monroe
...
...
Barack Obama
Donald Trump

Putting it all together

import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', class_='wikitable')
print(type(table))
for link in table.find_all('b'):
    name = link.find('a')
    print(name.get_text('title'))

We have successfully scrapped a web page in 10 lines of python code!! Bingo!

Leave a feedback in the comment box. Let me know if you have any questions in your mind or having any difficulty with this tutorial.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here