2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
HTTP is a client-server protocol, with the client and the server as the two parties in communication. The client sends an HTTP request, and the server receives and processes the request and returns an HTTP response.
An HTTP request consists of a request line, a request header, a blank line, and request data (such as form data in a POST request).
Request example:
POST /api/users HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Accept: application/json
Content-Type: application/json
Content-Length: 27
{
"name": "John",
"age": 30
}
Request Line:POST /api/users HTTP/1.1
Request Header: Contains Host, User-Agent, Accept, Content-Type, Content-Length, etc.
Blank lines: A blank line between the request header and the request body
Request Body:JSON data
An HTTP response consists of a status line, a response header, a blank line, and response data.
Assuming the server returns a simple HTML page, the response might be as follows:
HTTP/1.1 200 OK
Date: Sun, 02 Jun 2024 10:20:30 GMT
Server: Apache/2.4.41 (Ubuntu)
Content-Type: text/html; charset=UTF-8
Content-Length: 137
Connection: keep-alive
<!DOCTYPE html>
<html>
<head>
<title>Example Page</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a sample HTML page.</p>
</body>
</html>
Status Line:HTTP/1.1 200 OK
Response Headers: Contains Date, Server, Content-Type, Content-Length, Connection, etc.
Blank lines: A blank line between the response header and the response body
Response Body: Contains HTML code
The HTTP status code indicates the result of the server's processing of the request. Common status codes include:
Python's Requests library is a very powerful and easy to use HTTP library.
Before using it, you need to install the Requests library:pip install requests
The GET request is used to request data from the server. Using the Requests library to make a GET request is very simple:
import requests
# 发起GET请求
response = requests.get('https://news.baidu.com')
# 检查响应状态码
if response.status_code == 200:
# 打印响应内容
print(response.text)
else:
print(f"请求失败,状态码:{response.status_code}")
POST requests are used to submit data to the server. For example, websites that require login usually use POST requests to submit usernames and passwords. The method of initiating a POST request using the Requests library is as follows:
import requests
# 定义要发送的数据
data = {
'username': '123123123',
'password': '1231231312'
}
# 发起POST请求
response = requests.post('https://passport.bilibili.com/x/passport-login/web/login', data=data)
# 检查响应状态码
if response.status_code == 200:
# 打印响应内容
print(response.text)
else:
print(f"请求失败,状态码:{response.status_code}")
On some websites (such as Douban), crawlers are not allowed to have anti-crawling mechanisms, and it is necessary to set HTTP request headers and parameters to disguise themselves as browsers to pass identity authentication.
import requests
response = requests.get("https://movie.douban.com/top250")
if response.ok:
print(response.text)
else:
print("请求失败:" + str(response.status_code))
For example, in the code above, without setting the request header, Douban will deny us access.
We can enter any website, find a ready-made User-Agent, and put it in our request header.
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
print(response.text)
In this way, you can access Douban and obtain the content of the webpage.
BeautifulSoup is a Python library for parsing HTML and XML documents, and is particularly useful for extracting data from web pages.
Before using it, you need to install the BeautifulSoup library:pip install beautifulsoup4
html.parser
It is a built-in parser in Python and is suitable for most scenarios. Take Douban above as an example.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
# 使用html.parser来解析HTML内容
soup = BeautifulSoup(html, "html.parser")
BeautifulSoup provides a variety of methods to find and extract data from HTML documents.
BeautifulSoup common methods:
find(tag, attributes)
: Find the first tag that matches the criteria.find_all(tag, attributes)
: Find all tags that match the criteria.select(css_selector)
: Use CSS selectors to find tags that match the criteria.get_text()
: Get the text content within the tag.attrs
: Get the attribute dictionary of a tag.find
The method is used to find the first element that meets the conditions. For example, to find the first title in the page:
title = soup.find("span", class_="title")
print(title.string)
findAll
The method is used to find all elements that meet the conditions. For example, to find all titles in a page:
all_titles = soup.findAll("span", class_="title")
for title in all_titles:
print(title.string)
select
The method allows you to use CSS selectors to find elements. For example, to find all titles:
all_titles = soup.select("span.title")
for title in all_titles:
print(title.get_text())
can useattrs
Attributes get the attribute dictionary of an element. For example, get the URLs of all images:
all_images = soup.findAll("img")
for img in all_images:
print(img['src'])
Movie title: The HTML tag name is: span, and the class attribute of the specified element is title
Rating: The HTML tag is: span, and the class attribute of the specified element is rating_num
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}
response = requests.get(f"https://movie.douban.com/top250", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
# 获取所有电影
all_movies = soup.find_all("div", class_="item")
for movie in all_movies:
# 获取电影标题
titles = movie.find_all("span", class_="title")
for title in titles:
title_string = title.get_text()
if "/" not in title_string:
movie_title = title_string
# 获取电影评分
rating_num = movie.find("span", class_="rating_num").get_text()
# 输出电影标题和评分
print(f"电影: {movie_title}, 评分: {rating_num}")
The crawl was successful, but only the first page was crawled, and the subsequent content was not crawled successfully. Analyzing the URL connection above, each page is crawled through the URLstart
Parameters for paging.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}
for start_num in range(0, 250, 25):
response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
# 获取所有电影条目
all_movies = soup.find_all("div", class_="item")
for movie in all_movies:
# 获取电影标题
titles = movie.find_all("span", class_="title")
for title in titles:
title_string = title.get_text()
if "/" not in title_string:
movie_title = title_string
# 获取电影评分
rating_num = movie.find("span", class_="rating_num").get_text()
# 输出电影标题和评分
print(f"电影: {movie_title}, 评分: {rating_num}")