2024-07-11
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
A web crawler is a program used to automatically obtain web page content. It simulates the process of users browsing web pages, obtains the source code of web pages by sending HTTP requests, and uses parsing and extraction techniques to obtain the required data.
The crawler sends an HTTP request to the target website, which includes the URL, request method (such as GET or POST), request headers, etc. After receiving the request, the server returns an HTTP response, which includes a status code, response header, and response body (web page content).
requests
andaiohttp
, used to send HTTP requests.BeautifulSoup
、lxml
andPyQuery
, used to parse web page content.pandas
andSQLite
, used to store crawled data.asyncio
andaiohttp
, used to implement asynchronous crawlers and improve crawling efficiency.Next, we will use 7 small cases of Python crawlers to help you better learn and understand the basics of Python crawlers. The following is a brief introduction and source code of each case:
This case usesBeautifulSoup
The library crawls Douban Top 250 movie titles, ratings, and number of reviewers, and saves this information in a CSV file.
- import requests
- from bs4 import BeautifulSoup
- import csv
-
- # 请求URL
- url = 'https://movie.douban.com/top250'
- # 请求头部
- headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
- }
-
- # 解析页面函数
- def parse_html(html):
- soup = BeautifulSoup(html, 'lxml')
- movie_list = soup.find('ol', class_='grid_view').find_all('li')
- for movie in movie_list:
- title = movie.find('div', class_='hd').find('span', class_='title').get_text()
- rating_num = movie.find('div', class_='star').find('span', class_='rating_num').get_text()
- comment_num = movie.find('div', class_='star').find_all('span')[-1].get_text()
- writer.writerow([title, rating_num, comment_num])
-
- # 保存数据函数
- def save_data():
- f = open('douban_movie_top250.csv', 'a', newline='', encoding='utf-8-sig')
- global writer
- writer = csv.writer(f)
- writer.writerow(['电影名称', '评分', '评价人数'])
- for i in range(10):
-