Web scraping with Scrapy
You will need three components for web scraping:
- a tool to GET files from the web,
- a tool to figure out how to process these files, and
- a tool to do the actual processing.
The one in the middle is where we, the humans, come in. The Chrome developer tools (or whatever they are called in your browser of choice) are our friends here. We can use the `Elements` tab to figure out the structure of a page and the identifiers we need to navigate that structure:
You can use Beautiful Soup to automate the processing, combined with requests or the default urllib to do the file transfers. But Scrapy provides an all-in-one package.
Scrapy
There are two ways to use Scrapy, the hard way and the quick-'n-dirty way:
- For the hard way you let Scrapy generate a project for you with the startproject command. This will give you all the bells and whistles you need for extensive web-scraping, including items, middleware and pipelines. This will allow you to write a whole nest of related (or unrelated) spiders and deploy them to the cloud. But for most use cases this is far more than is needed, and the easy way is sufficient.
- The easy way is to simply write a spider and run it with the runspider command. This will give you just one, simple spider, but I find that in most cases this is sufficient.
The scrapy shell
command allows you the research the page and experiment with selectors.
The scrapy view
command (both in the Scrapy shell and from the command line as a parameter to the scrapy
command) opens a page in your browser, as seen by Scrapy. This prevents differences between how your browser GETs a page and how Scrape sees it.