Python Spyder - I Crawl and Scrape!

Step by step guide to building a bot in python using Scrapy and Spyder that crawls and scrapes websites.

A. Preface

As a part of my Data science capstone project, I had to get data from website links. I started my search and found numerous ways to scrape data of the web. I found Beautiful soup and scrapy to scrape data of a website. Thanks to the open source community and documentation, the web has enough information, all i had to do was to adapt it to what I wanted.

The next concern to address was how to scrape data from around 5000 web pages! That is when I came across spiders. And Scrapy spider was what I found. The data was scattered across the web, but this particular article on github did help me.

Credits:

  • https://github.com/scrapy/quotesbot

Bot I created to crawl and scrape:

  • https://github.com/roshanlulu/bots

B. Step by step guide on how to create a bot for yourself.

1. Create the bot/spider framework in just one line !!! Its that simple !

  • TODO: Run from the command line: scrapy startproject <projname>
  • NOTES: A set of project files/folders will be created in that location.
    • scrapy.cfg: the project configuration file
    • /: the project’s python module, you’ll later import your code from here.
    • /items.py: the project’s items file.
    • /pipelines.py: the project’s pipelines file.
    • /settings.py: the project’s settings file.
    • /spiders/: a directory where you’ll later put your spiders.
  • project path created -

2. Update settings file

  • TODO: Add this line to settings.py file 'DOWNLOAD_HANDLERS = {'s3': None,}'
  • NOTES: I am yet to check why we are doing this.

3. Create the spider

  • TODO: Create your spider file Spiderfile.py inside the folder called spiders[from from step 1]
  • NOTES: Spider sounds fancy and complicated. But a spider is just another .py file.

4 .Define the spider class

  • TODO: Add a spider class inside 'Spiderfile.py' from Step 3
  • Code/Pseudocode:
# import libraries
import scrapy

class GiveAnyClassName(scrapy.Spider):
    # Naming ceremony of the spider
    name = 'thats_the_spiders_name'
    # Provide the list of urls that you want to parse through
    start_urls = [List of urls to parse through]
    # Define your parse function inside your class

    def parse(self, response):
       yield {
            response.xpath('//b/text()').extract()[0]:response.xpath('//td/font/text()').extract()[0],
            response.xpath('//b/text()').extract()[1]: response.xpath('//td/font/text()').extract()[1]
       }
        

5. To list out the spiders you have from command line

  • TODO: Run the command 'scrapy list' from your bot path. If you can find your spider from step 4, Yaay!

6. Command to kickstart your spider to start crawling

  • TODO: Your spider is ready to crawl the web.
    • Command to crawl using your spider and write output to command line
      • scrapy crawl <spider-name>
    • Command to crawl using your spider and write the output to a filename
      • scrapy crawl <spider-name> -o <filename.json/.csv>
Written on May 30, 2017