Scrapy write crawler

2024-07-11

Scrapy is a Python framework for crawling website data and extracting structured information.

1. Introduction to Scrapy

1. Engine
– Scrapy's engine is the core of controlling data flow and triggering events. It manages the requests sent by the Spider and the responses received, as well as processing the Items generated by the Spider. The engine is the driving force behind Scrapy's operation.

2. Scheduler
– The scheduler is responsible for receiving requests sent by the engine and enqueuing them according to certain policies (such as priority, depth, etc.). When the engine needs a new request, the scheduler will take the request from the queue and return it to the engine. It ensures the orderly processing of requests.

3. Downloader
– The downloader is responsible for downloading web page content according to the request sent by the Scrapy engine. It uses the HTTP protocol to communicate with the website server and returns the downloaded web page content to the Scrapy engine as a response. The downloader is the core component of Scrapy to obtain web page data.

4.Spiders
– Spiders are components in Scrapy that define crawling logic and parse web page content. They generate initial requests based on defined rules, process responses returned by downloaders, extract required data (items) from them, or generate new requests (requests) for further crawling.

5.Item Pipelines
– Item Pipelines are responsible for processing items extracted by Spider. They can perform various tasks, such as cleaning data, verifying data integrity, and storing data in a database or file. By defining multiple Pipelines, you can flexibly process data to meet different needs.

6. Downloader Middlewares
– Downloader middleware sits between the Scrapy engine and the downloader and is used to process requests and responses. They can modify requests (such as adding request headers, setting up proxies, etc.) or responses (such as compression, redirection, etc.) to control how Scrapy interacts with the website. Middleware is an important mechanism for Scrapy to extend its functionality.

7.Spider Middlewares
– Spider middleware sits between the Scrapy engine and Spiders and is used to process Spider input (i.e. responses) and output (i.e. Items and new requests). They can modify or discard responses, handle exceptions, and even modify or discard Items and Requests generated by Spiders. Spider middleware provides the ability to insert custom functionality during Spider execution.

The data flow between the components is shown in the figure:
insert image description here

Starting from the initial URL, the Scheduler will hand it over to the Downloader for downloading.
After downloading, it will be handed over to Spider for analysis
There are two kinds of results from Spider analysis
One is the link that needs to be further crawled, such as the "next page" link, which will be passed back to the Scheduler;
The other is the data that needs to be saved, which is sent to the Item Pipeline for post-processing (detailed analysis, filtering, storage, etc.)

2. Install scrapy

pip install scrapy

Technology Sharing