2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Allow me to happily re-enact the protagonist of a beautiful story
Play the role of your old lover who shared your dreams
Become the lover who doesn't shed tears of love
Pretend to play the same role as before
Re-enacting a beautiful story
Play the role of your old lover who shared your dreams
Even though you don't understand, you are still alone in the middle of the night
Wear your silent sweater to get closer to you
🎵 陈慧娴《傻女》
Scrapy is a powerful crawler framework. By using middleware, users can customize and extend the behavior of the crawler. Middleware provides a mechanism for pre-processing and post-processing requests and responses, allowing users to enhance the functionality of the crawler without modifying the core code.
In Scrapy, the order in which middleware are executed is determined by their "priority". Understanding and correctly setting the priority of middleware is essential to building efficient and maintainable crawlers.
Middleware is a type of hook in Scrapy that allows users to execute custom code when processing requests and responses. Middleware is divided into two categories:
The level of middleware determines the order in which they are executed. Scrapy uses an integer value to represent the level of middleware. The smaller the value, the earlier the middleware is executed.
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
}
In the above configuration:
CustomDownloaderMiddleware 的等级为 543
UserAgentMiddleware 的等级为 400
RetryMiddleware 的等级为 500
The execution order is as follows:
UserAgentMiddleware(400)
RetryMiddleware(500)
CustomDownloaderMiddleware(543)
When a request is sent from the engine, it first passes through the lower-level middleware and finally reaches the downloader. When a response is returned from the downloader, it first passes through the higher-level middleware and finally reaches the engine.
The crawler middleware sits between the engine and the crawler. Here is an example configuration:
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
}
In the above configuration:
CustomSpiderMiddleware 的等级为 543
HttpErrorMiddleware 的等级为 50
OffsiteMiddleware 的等级为 500
The execution order is as follows:
HttpErrorMiddleware(50)
OffsiteMiddleware(500)
CustomSpiderMiddleware(543)
When a request is sent from the engine, it first passes through the lower-level middleware and finally reaches the crawler. When a response is returned from the crawler, it first passes through the higher-level middleware and finally reaches the engine.
To set the level of the middleware, you need to define the corresponding dictionary in the Scrapy configuration file settings.py to specify the path and level of the middleware. For example:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
}
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
}
In this example, we define a downloader middleware CustomDownloaderMiddleware and a spider middleware CustomSpiderMiddleware, and set their levels to 543 respectively.
Scrapy provides many built-in middlewares, each with a default level. Here are some common downloader middlewares and their default levels:
UserAgentMiddleware: 400
RetryMiddleware: 500
RedirectMiddleware: 600
CookiesMiddleware: 700
For crawler middleware, common ones are:
HttpErrorMiddleware: 50
OffsiteMiddleware: 500
RefererMiddleware: 700
Middleware is a powerful feature of the Scrapy framework. By correctly setting the level of the middleware, you can finely control the processing of requests and responses. Understanding and using the level setting rules of the middleware will help build a more flexible and efficient crawler system.