Technology Sharing

【Crawler】Crawler Basics

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


1. Http response and request

HTTP is a client-server protocol, with the client and the server as the two parties in communication. The client sends an HTTP request, and the server receives and processes the request and returns an HTTP response.

1. Http request

An HTTP request consists of a request line, a request header, a blank line, and request data (such as form data in a POST request).

  • The request line contains the request method, the requested URL, and the protocol version. Common request methods include GET, POST, PUT, DELETE, etc.
  • The request header contains other information about the client and the request, such as User-Agent, Accept, Content-Type, etc.
  • A blank line is used to separate the request header and the request data.
  • Request data is usually used for POST requests and contains the submitted data.

Request example:

POST /api/users HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Accept: application/json
Content-Type: application/json
Content-Length: 27

{
  "name": "John",
  "age": 30
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

Request Line:POST /api/users HTTP/1.1

Request Header: Contains Host, User-Agent, Accept, Content-Type, Content-Length, etc.

Blank lines: A blank line between the request header and the request body

Request Body:JSON data

2. Http response

An HTTP response consists of a status line, a response header, a blank line, and response data.

  • The status line contains the protocol version, status code, and status message. The status code indicates the processing result of the request, such as 200 for success, 404 for resource not found, and 500 for internal server error.
  • The response header contains other information about the server and the response, such as Server, Content-Type, Content-Length, etc.
  • A blank line is used to separate the response header from the response data.
  • The response data contains the data returned by the server, such as HTML, JSON, etc.

Assuming the server returns a simple HTML page, the response might be as follows:

HTTP/1.1 200 OK
Date: Sun, 02 Jun 2024 10:20:30 GMT
Server: Apache/2.4.41 (Ubuntu)
Content-Type: text/html; charset=UTF-8
Content-Length: 137
Connection: keep-alive

<!DOCTYPE html>
<html>
<head>
    <title>Example Page</title>
</head>
<body>
    <h1>Hello, World!</h1>
    <p>This is a sample HTML page.</p>
</body>
</html>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

Status Line:HTTP/1.1 200 OK

Response Headers: Contains Date, Server, Content-Type, Content-Length, Connection, etc.

Blank lines: A blank line between the response header and the response body

Response Body: Contains HTML code

3. Status Code

The HTTP status code indicates the result of the server's processing of the request. Common status codes include:

  • 1xx: Informational response, indicating that the request has been received and continues to be processed.
  • 2xx: Success, indicating that the request has been successfully received, understood, and accepted by the server.
  • 3xx: Redirect, indicating that further action is required to complete the request.
  • 4xx: Client Error, indicating that the server cannot process the request.
  • 5xx: Server Error, indicating that an error occurred on the server while processing the request.

status code

2. Requests Library

Python's Requests library is a very powerful and easy to use HTTP library.

Before using it, you need to install the Requests library:pip install requests

1. Initiate a GET request

The GET request is used to request data from the server. Using the Requests library to make a GET request is very simple:

import requests
# 发起GET请求
response = requests.get('https://news.baidu.com')
# 检查响应状态码
if response.status_code == 200:
    # 打印响应内容
    print(response.text)
else:
    print(f"请求失败,状态码:{response.status_code}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

2. Initiate a POST request

POST requests are used to submit data to the server. For example, websites that require login usually use POST requests to submit usernames and passwords. The method of initiating a POST request using the Requests library is as follows:

import requests

# 定义要发送的数据
data = {
    'username': '123123123',
    'password': '1231231312'
}

# 发起POST请求
response = requests.post('https://passport.bilibili.com/x/passport-login/web/login', data=data)

# 检查响应状态码
if response.status_code == 200:
    # 打印响应内容
    print(response.text)
else:
    print(f"请求失败,状态码:{response.status_code}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

3. Processing request headers

On some websites (such as Douban), crawlers are not allowed to have anti-crawling mechanisms, and it is necessary to set HTTP request headers and parameters to disguise themselves as browsers to pass identity authentication.

import requests

response = requests.get("https://movie.douban.com/top250")
if response.ok:
    print(response.text)
else:
    print("请求失败:" + str(response.status_code))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Request failed

For example, in the code above, without setting the request header, Douban will deny us access.

image-20240607014319894

We can enter any website, find a ready-made User-Agent, and put it in our request header.

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}

response = requests.get("https://movie.douban.com/top250", headers=headers)
print(response.text)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

image-20240607014435738

In this way, you can access Douban and obtain the content of the webpage.

3. BeautifulSoup Library

BeautifulSoup is a Python library for parsing HTML and XML documents, and is particularly useful for extracting data from web pages.

Before using it, you need to install the BeautifulSoup library:pip install beautifulsoup4

1. Parsing HTML documents

html.parserIt is a built-in parser in Python and is suitable for most scenarios. Take Douban above as an example.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}

response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
# 使用html.parser来解析HTML内容
soup = BeautifulSoup(html, "html.parser")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

2. Find and extract data

BeautifulSoup provides a variety of methods to find and extract data from HTML documents.

BeautifulSoup common methods:

  • find(tag, attributes): Find the first tag that matches the criteria.
  • find_all(tag, attributes): Find all tags that match the criteria.
  • select(css_selector): Use CSS selectors to find tags that match the criteria.
  • get_text(): Get the text content within the tag.
  • attrs: Get the attribute dictionary of a tag.

I. Finding a single element

findThe method is used to find the first element that meets the conditions. For example, to find the first title in the page:

title = soup.find("span", class_="title")
print(title.string)
  • 1
  • 2

Ⅱ. Find all elements

findAllThe method is used to find all elements that meet the conditions. For example, to find all titles in a page:

all_titles = soup.findAll("span", class_="title")
for title in all_titles:
    print(title.string)
  • 1
  • 2
  • 3

III. Using CSS Selectors

selectThe method allows you to use CSS selectors to find elements. For example, to find all titles:

all_titles = soup.select("span.title")
for title in all_titles:
    print(title.get_text())
  • 1
  • 2
  • 3

IV. Get element attributes

can useattrsAttributes get the attribute dictionary of an element. For example, get the URLs of all images:

all_images = soup.findAll("img")
for img in all_images:
    print(img['src'])
  • 1
  • 2
  • 3

4. Crawling Douban Movie List

image-20240607021500369

Movie title: The HTML tag name is: span, and the class attribute of the specified element is title

image-20240607021403243

Rating: The HTML tag is: span, and the class attribute of the specified element is rating_num

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}
    response = requests.get(f"https://movie.douban.com/top250", headers=headers)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")

    # 获取所有电影
    all_movies = soup.find_all("div", class_="item")

    for movie in all_movies:
        # 获取电影标题
        titles = movie.find_all("span", class_="title")
        for title in titles:
            title_string = title.get_text()
            if "/" not in title_string:
                movie_title = title_string

        # 获取电影评分
        rating_num = movie.find("span", class_="rating_num").get_text()

        # 输出电影标题和评分
        print(f"电影: {movie_title}, 评分: {rating_num}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

image-20240607021144542

The crawl was successful, but only the first page was crawled, and the subsequent content was not crawled successfully. Analyzing the URL connection above, each page is crawled through the URLstartParameters for paging.

image-20240607020345475

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}

for start_num in range(0, 250, 25):
    response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")

    # 获取所有电影条目
    all_movies = soup.find_all("div", class_="item")

    for movie in all_movies:
        # 获取电影标题
        titles = movie.find_all("span", class_="title")
        for title in titles:
            title_string = title.get_text()
            if "/" not in title_string:
                movie_title = title_string

        # 获取电影评分
        rating_num = movie.find("span", class_="rating_num").get_text()

        # 输出电影标题和评分
        print(f"电影: {movie_title}, 评分: {rating_num}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28