This tutorial would walk you through how to scrape data from a table on Wikepedia.
The page we would be scraping data from is List of countries and dependencies by population. You can visit the link to a get a feel of how the page looks. The table with data to be scraped is shown below –
Beatifulsoup – A library for pulling data out of html and xml files.
Requests – A library for making HTTP requests in python.
It is assumed that you already have Python 3 installed on your machine if not follow here to install Python 3 on OSx.
Run the commands below to install the beatifulsoup and requests library
pip install requests
pip install beautifulsoup4
Scrape the data
Navigate to a specific directory on your machine and run the command below to create a new file named
In the main.py add the following code:
import csv import requests from bs4 import BeautifulSoup def scrape_data(url): response = requests.get(url, timeout=10) soup = BeautifulSoup(response.content, 'html.parser') table = soup.find_all('table') rows = table.select('tbody > tr') header = [th.text.rstrip() for th in rows.find_all('th')] with open('output.csv', 'w') as csv_file: writer = csv.writer(csv_file) writer.writerow(header) for row in rows[1:]: data = [th.text.rstrip() for th in row.find_all('td')] writer.writerow(data) if __name__=="__main__": url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" scrape_data(url)
let us break this apart and see how it works line by line.
Lines 1 – 3 Imports all the packages needed to run the application.
Line 6 We define the fuction
scrape_datathat takes a
Line 8 We make a get request to the
urlusing the get method of the
When making HTTP requests with the request library, it is important to set a timeout incase the server does not respond in a timely manner. This prevents your program from hanging indefinitely while waiting for a response from the server.
Line 9 We create a beatuful soup tree structure from the content of the response from the server. This object is easy to navigate and search through.
Line 11 We search throught the beatufiful object
soupto find the second table in the document which contains the data we want using the it’s
find_allmethod. The beautifulsoup object’s
find_allmethod searches for all html tags that match the filter/search-term in the tree structure.
Line 13 This line of code selects all the
the parent is a
tbodyelement from the
trelements represents the table rows.
Line 15 The first row ussually contains the header cells. We serch throught the first row in the rows list to get the text values of all
thelements in that row.
we also ensure to remove the all trailing whitespaces in the text using the
rstrippython string method.
Line 17 – 22 This opens a file and creates a new file object. The
wmode is used to ensure the file is open for writing.
First we write the header row, then loop through the rest of the rows ignoring the first row to get the data contained within and write the data for all those rows to the file object.
Line 25 -27 We check to ensure the module is run as the main program and call the function
scrape_datawith a specified
urlto scrape the data.
on a the terminal run the command below to scrape the data
An output file named
output.csv containing the data should produced in the root folder
Before you begin scraping data from any website, ensure to study the HTML markup/ content of the website to determine the location of the data you want.
If you have any questions or comments, please add them to the comments section below.