scrapers are a great way to collect data for projects. In this example I will use the @Scrapy Framework to create a web that gets the links of products when searched for “headphones” on amazon.com

To start with let’s check if we have the library set to go. Open the terminal on your Mac OS device and type:

$ scrapy version

At the time of this post I am using this version(1.5.1). You should be getting an output similar to this.

Scrapy 1.5.1

If you do not have it installed yet, you can use pip to install it. Here is how to do it:

$ pip install Scrapy

Time to get to the coding. Scrapy uses a command in terminal to create a project. It will set you up with a set of files to start with. Navigate to the directory where you want to save the project. To start a project you will start by typing this line in the terminal:

$ scrapy startproject headphones

Scrapy will create the directory with contents like this:

headphones/
    scrapy.cfg    
    headphones/
        __init__.py
        items.py          
        middlewares.py    
        pipelines.py      
        settings.py       
        spiders/          
            __init__.py

We will create our first file under the spiders directory. I will name it headphone_spider.py. After creating the new file, your directory structure should look like this:

headphones/
    scrapy.cfg    
    headphones/
        __init__.py
        items.py          
        middlewares.py    
        pipelines.py      
        settings.py       
        spiders/          
            __init__.py
            headphone_spider.py

Our first part to code will be to import scrapy and create a class that will scrape the web for us.

import scrapy    # adding scrapy to our file

class HeadphoneSpider(scrapy.Spider):   # our class inherits from scrapy.Spider

  name = "headphones"   # we will name this as headphones and we will need it later on

The line name = "headphones" is very important here. When we will run our spider form terminal, you will use the name of the spider to start crawling the web. The name should be as clear as possible since you may have multiple spiders later on.

Now it is time for the first function in our class! To do web scraping we will have two important parts. The first part is to send a request to the website(s) we will scrape. we will name our function start_requests and we will define a list of urls that we want to visit and send requests to them.

def start_requests(self):
  urls = []	# list to enter our urls

    for url in urls:
    	yield scrapy.Request(url=url, callback=self.parse)  # we will explain the callback soon

The keyword yield creates a generator but acts like the return keyword. It will return a generator. Generators are useful when you are using a list of items and will not use them again. They will be processed once and then forgotten.

The keyword argument callback is used to call another function when there is a response from the function. After scrapy.Request(url=url, callback=self.parse) is completed, it calls our second function in the class. Which will be named parse. Here it is:

def parse(self, response):
  img_urls = response.css('img::attr(src)').extract()
  with open('urls.txt', 'w') as f:
    	for u in img_urls:
        	f.write(u + "n")

In this function the first line is where we use Scrapy’s selectors. We use the response that is generated from the scrapy.Request() function as a parameter. we will use .css() as the selector to parse the data we are looking for. since we are looking for images we will enter .css('img') but this will give use all of the <img> tags which does not extract what we need. Since image tags in HTML have an attribute src, we will use this to select the source of the image. So now we have the selector object! The only thing we have left is to extract the data we found by just adding the .extract() function to the end. After all this I write all the names of the urls to a text file.


Here is the final version of our headphone_spider.py file.

import scrapy

class HeadphonesSpider(scrapy.Spider):

    name = "headphones"

    def start_requests(self):
        urls = [
        'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2',
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        img_urls = response.css('img::attr(src)').extract()
        with open('urls.txt', 'w') as f:
            for u in img_urls:
                f.write(u + "n")

To run your spider go to the directory that includes the whole project, top level directory, and in the terminal type:

$ scrapy crawl headphones

The name headphones is the name that we used in the class. And it will start crawling!

This was the first project I had done with Scrapy. I used the @tutorial page to guide myself. In another post I will show how to allow your spider to skip pages and extract data from those pages as well.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here