Python Spyder - I Crawl and Scrape!
Step by step guide to building a bot in python using Scrapy and Spyder that crawls and scrapes websites.
A. Preface
As a part of my Data science capstone project, I had to get data from website links. I started my search and found numerous ways to scrape data of the web. I found Beautiful soup and scrapy to scrape data of a website. Thanks to the open source community and documentation, the web has enough information, all i had to do was to adapt it to what I wanted.
The next concern to address was how to scrape data from around 5000 web pages! That is when I came across spiders. And Scrapy spider was what I found. The data was scattered across the web, but this particular article on github did help me.
Credits:
https://github.com/scrapy/quotesbot
Bot I created to crawl and scrape:
https://github.com/roshanlulu/bots
B. Step by step guide on how to create a bot for yourself.
1. Create the bot/spider framework in just one line !!! Its that simple !
- TODO:
Run from the command line: scrapy startproject <projname>
- NOTES: A set of project files/folders will be created in that location.
- scrapy.cfg: the project configuration file
-
/: the project’s python module, you’ll later import your code from here. -
/items.py: the project’s items file. -
/pipelines.py: the project’s pipelines file. -
/settings.py: the project’s settings file. -
/spiders/: a directory where you’ll later put your spiders.
- project path created -
2. Update settings file
- TODO:
Add this line to settings.py file 'DOWNLOAD_HANDLERS = {'s3': None,}'
- NOTES: I am yet to check why we are doing this.
3. Create the spider
- TODO:
Create your spider file Spiderfile.py inside the folder called spiders[from from step 1]
- NOTES: Spider sounds fancy and complicated. But a spider is just another .py file.
4 .Define the spider class
- TODO:
Add a spider class inside 'Spiderfile.py'
from Step 3 - Code/Pseudocode:
5. To list out the spiders you have from command line
- TODO:
Run the command 'scrapy list'
from your bot path. If you can find your spider from step 4, Yaay!
6. Command to kickstart your spider to start crawling
- TODO: Your spider is ready to crawl the web.
- Command to crawl using your spider and write output to command line
scrapy crawl <spider-name>
- Command to crawl using your spider and write the output to a filename
scrapy crawl <spider-name> -o <filename.json/.csv>
- Command to crawl using your spider and write output to command line