Python crawler principle and 3 small cases (source code)

2024-07-11

1. Crawler Principle

A web crawler is a program used to automatically obtain web page content. It simulates the process of users browsing web pages, obtains the source code of web pages by sending HTTP requests, and uses parsing and extraction techniques to obtain the required data.

1. HTTP request and response process

The crawler sends an HTTP request to the target website, which includes the URL, request method (such as GET or POST), request headers, etc. After receiving the request, the server returns an HTTP response, which includes a status code, response header, and response body (web page content).

2. Common crawler technologies

Request Library:For examplerequestsandaiohttp, used to send HTTP requests.
Parsing Library:For exampleBeautifulSoup、lxmlandPyQuery, used to parse web page content.
Repository:For examplepandasandSQLite, used to store crawled data.
Asynchronous Libraries:For exampleasyncioandaiohttp, used to implement asynchronous crawlers and improve crawling efficiency.

2. Commonly used libraries for Python crawlers

1. Request Library

requests: A concise and powerful HTTP library that supports HTTP connection persistence and connection pool, SSL certificate verification, Cookies, etc.
aiohttp: An asynchronous HTTP library based on asyncio, suitable for high-concurrency crawler scenarios.

2. Parsing Library

BeautifulSoup: A library for parsing HTML and XML, simple and easy to use, supporting multiple parsers.
lxml: An efficient XML and HTML parsing library that supports XPath and CSS selectors.
PyQuery: A Python version of jQuery, with similar syntax to jQuery and easy to use.

3. Repository

pandas: A powerful data analysis library that provides data structures and data analysis tools and supports multiple file formats.
SQLite: A lightweight database that supports SQL queries and is suitable for small crawler projects.

Next, we will use 7 small cases of Python crawlers to help you better learn and understand the basics of Python crawlers. The following is a brief introduction and source code of each case:

Case 1: Crawling Douban Movie Top 250

This case usesBeautifulSoupThe library crawls Douban Top 250 movie titles, ratings, and number of reviewers, and saves this information in a CSV file.


import requests
from bs4 import BeautifulSoup
import csv
 
# 请求URL
url = 'https://movie.douban.com/top250'
# 请求头部
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
 
# 解析页面函数
def parse_html(html):
    soup = BeautifulSoup(html, 'lxml')
    movie_list = soup.find('ol', class_='grid_view').find_all('li')
    for movie in movie_list:
        title = movie.find('div', class_='hd').find('span', class_='title').get_text()
        rating_num = movie.find('div', class_='star').find('span', class_='rating_num').get_text()
        comment_num = movie.find('div', class_='star').find_all('span')[-1].get_text()
        writer.writerow([title, rating_num, comment_num])
 
# 保存数据函数
def save_data():
    f = open('douban_movie_top250.csv', 'a', newline='', encoding='utf-8-sig')
    global writer
    writer = csv.writer(f)
    writer.writerow(['电影名称', '评分', '评价人数'])
    for i in range(10):
        url = 'https://movie.douban.com/top250?start='   str(i * 25)   '



  
  
  


    
      
        
      
		
          个人简介
          潜心研究技术三十余年，精通java、linux、javascript、php、css、等等各种语言，在开源领域有诸多贡献，建立开发者文档站，将一些技术开发中的问题分享出来，以供大家查阅 
		
      
        
        
          我的联系方式
          邮箱[email protected]

Technology Sharing