Web scrapers are a great way to collect data for projects. In this example I will use the @Scrapy Framework to create a web scraper that gets the links of products when searched for “headphones” on amazon.com
To start with let’s check if we have the scrapy library set to go. Open the terminal on your Mac OS device and type:
$ scrapy version
At the time of this post I am using this version(1.5.1). You should be getting an output similar to this.
If you do not have it installed yet, you can use
pip to install it. Here is how to do it:
$ pip install Scrapy
Time to get to the coding. Scrapy uses a command in terminal to create a project. It will set you up with a set of files to start with. Navigate to the directory where you want to save the project. To start a project you will start by typing this line in the terminal:
$ scrapy startproject headphones
Scrapy will create the directory with contents like this:
headphones/ scrapy.cfg headphones/ __init__.py items.py middlewares.py pipelines.py settings.py spiders/ __init__.py
We will create our first file under the spiders directory. I will name it
headphone_spider.py. After creating the new file, your directory structure should look like this:
headphones/ scrapy.cfg headphones/ __init__.py items.py middlewares.py pipelines.py settings.py spiders/ __init__.py headphone_spider.py
Our first part to code will be to import scrapy and create a class that will scrape the web for us.
import scrapy # adding scrapy to our file class HeadphoneSpider(scrapy.Spider): # our class inherits from scrapy.Spider name = "headphones" # we will name this as headphones and we will need it later on
name = "headphones" is very important here. When we will run our spider form terminal, you will use the name of the spider to start crawling the web. The name should be as clear as possible since you may have multiple spiders later on.
Now it is time for the first function in our class! To do web scraping we will have two important parts. The first part is to send a request to the website(s) we will scrape. we will name our function
start_requests and we will define a list of urls that we want to visit and send requests to them.
def start_requests(self): urls =  # list to enter our urls for url in urls: yield scrapy.Request(url=url, callback=self.parse) # we will explain the callback soon
yield creates a generator but acts like the
return keyword. It will return a generator. Generators are useful when you are using a list of items and will not use them again. They will be processed once and then forgotten.
The keyword argument
callback is used to call another function when there is a response from the function. After
scrapy.Request(url=url, callback=self.parse) is completed, it calls our second function in the class. Which will be named
parse. Here it is:
def parse(self, response): img_urls = response.css('img::attr(src)').extract() with open('urls.txt', 'w') as f: for u in img_urls: f.write(u + "n")
In this function the first line is where we use Scrapy’s selectors. We use the response that is generated from the
scrapy.Request() function as a parameter. we will use
.css() as the selector to parse the data we are looking for. since we are looking for images we will enter
.css('img') but this will give use all of the
<img> tags which does not extract what we need. Since image tags in HTML have an attribute
src, we will use this to select the source of the image. So now we have the selector object! The only thing we have left is to extract the data we found by just adding the
.extract() function to the end. After all this I write all the names of the urls to a text file.
Here is the final version of our
import scrapy class HeadphonesSpider(scrapy.Spider): name = "headphones" def start_requests(self): urls = [ 'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): img_urls = response.css('img::attr(src)').extract() with open('urls.txt', 'w') as f: for u in img_urls: f.write(u + "n")
To run your spider go to the directory that includes the whole project, top level directory, and in the terminal type:
$ scrapy crawl headphones
headphones is the name that we used in the class. And it will start crawling!
This was the first project I had done with Scrapy. I used the @tutorial page to guide myself. In another post I will show how to allow your spider to skip pages and extract data from those pages as well.