【Scrapy】Scrapy middleware level setting rules

2024-07-12

Allow me to happily re-enact the protagonist of a beautiful story
Play the role of your old lover who shared your dreams
Become the lover who doesn't shed tears of love
Pretend to play the same role as before
Re-enacting a beautiful story
Play the role of your old lover who shared your dreams
Even though you don't understand, you are still alone in the middle of the night
Wear your silent sweater to get closer to you
🎵 陈慧娴《傻女》

Scrapy is a powerful crawler framework. By using middleware, users can customize and extend the behavior of the crawler. Middleware provides a mechanism for pre-processing and post-processing requests and responses, allowing users to enhance the functionality of the crawler without modifying the core code.

In Scrapy, the order in which middleware are executed is determined by their "priority". Understanding and correctly setting the priority of middleware is essential to building efficient and maintainable crawlers.

What is middleware?

Middleware is a type of hook in Scrapy that allows users to execute custom code when processing requests and responses. Middleware is divided into two categories:

Downloader Middleware: handles downloader-related requests and responses.
Spider Middleware: handles spider-related input and output.

Middleware Level

The level of middleware determines the order in which they are executed. Scrapy uses an integer value to represent the level of middleware. The smaller the value, the earlier the middleware is executed.

Downloader Middleware
The downloader middleware sits between Scrapy's downloader and the engine. Here's an example configuration:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
}
1
2
3
4
5

In the above configuration:

CustomDownloaderMiddleware 的等级为 543
UserAgentMiddleware 的等级为 400
RetryMiddleware 的等级为 500
1
2
3

The execution order is as follows:

UserAgentMiddleware（400）
RetryMiddleware（500）
CustomDownloaderMiddleware（543）
1
2
3

When a request is sent from the engine, it first passes through the lower-level middleware and finally reaches the downloader. When a response is returned from the downloader, it first passes through the higher-level middleware and finally reaches the engine.

Spider Middleware

The crawler middleware sits between the engine and the crawler. Here is an example configuration:

SPIDER_MIDDLEWARES = {
    'myproject.middlewares.CustomSpiderMiddleware': 543,
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
}
1
2
3
4
5

In the above configuration:

CustomSpiderMiddleware 的等级为 543
HttpErrorMiddleware 的等级为 50
OffsiteMiddleware 的等级为 500
1
2
3

The execution order is as follows:

HttpErrorMiddleware（50）
OffsiteMiddleware（500）
CustomSpiderMiddleware（543）
1
2
3

When a request is sent from the engine, it first passes through the lower-level middleware and finally reaches the crawler. When a response is returned from the crawler, it first passes through the higher-level middleware and finally reaches the engine.

How to set the level of middleware

To set the level of the middleware, you need to define the corresponding dictionary in the Scrapy configuration file settings.py to specify the path and level of the middleware. For example:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
}

SPIDER_MIDDLEWARES = {
    'myproject.middlewares.CustomSpiderMiddleware': 543,
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
}
1
2
3
4
5
6
7
8
9

In this example, we define a downloader middleware CustomDownloaderMiddleware and a spider middleware CustomSpiderMiddleware, and set their levels to 543 respectively.

Common middleware and their default levels

Scrapy provides many built-in middlewares, each with a default level. Here are some common downloader middlewares and their default levels:

UserAgentMiddleware: 400
RetryMiddleware: 500
RedirectMiddleware: 600
CookiesMiddleware: 700
1
2
3
4

For crawler middleware, common ones are:

HttpErrorMiddleware: 50
OffsiteMiddleware: 500
RefererMiddleware: 700
1
2
3

in conclusion

Middleware is a powerful feature of the Scrapy framework. By correctly setting the level of the middleware, you can finely control the processing of requests and responses. Understanding and using the level setting rules of the middleware will help build a more flexible and efficient crawler system.

Technology Sharing