Technology Sharing

Python crawler principle and 3 small cases (source code)

2024-07-11

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

1. Crawler Principle

A web crawler is a program used to automatically obtain web page content. It simulates the process of users browsing web pages, obtains the source code of web pages by sending HTTP requests, and uses parsing and extraction techniques to obtain the required data.

1. HTTP request and response process

The crawler sends an HTTP request to the target website, which includes the URL, request method (such as GET or POST), request headers, etc. After receiving the request, the server returns an HTTP response, which includes a status code, response header, and response body (web page content).

2. Common crawler technologies

  • Request Library:For examplerequestsandaiohttp, used to send HTTP requests.
  • Parsing Library:For exampleBeautifulSouplxmlandPyQuery, used to parse web page content.
  • Repository:For examplepandasandSQLite, used to store crawled data.
  • Asynchronous Libraries:For exampleasyncioandaiohttp, used to implement asynchronous crawlers and improve crawling efficiency.

2. Commonly used libraries for Python crawlers

1. Request Library

  • requests: A concise and powerful HTTP library that supports HTTP connection persistence and connection pool, SSL certificate verification, Cookies, etc.
  • aiohttp: An asynchronous HTTP library based on asyncio, suitable for high-concurrency crawler scenarios.

2. Parsing Library

  • BeautifulSoup: A library for parsing HTML and XML, simple and easy to use, supporting multiple parsers.
  • lxml: An efficient XML and HTML parsing library that supports XPath and CSS selectors.
  • PyQuery: A Python version of jQuery, with similar syntax to jQuery and easy to use.

3. Repository

  • pandas: A powerful data analysis library that provides data structures and data analysis tools and supports multiple file formats.
  • SQLite: A lightweight database that supports SQL queries and is suitable for small crawler projects.

Next, we will use 7 small cases of Python crawlers to help you better learn and understand the basics of Python crawlers. The following is a brief introduction and source code of each case:

Case 1: Crawling Douban Movie Top 250

This case usesBeautifulSoupThe library crawls Douban Top 250 movie titles, ratings, and number of reviewers, and saves this information in a CSV file.

  1. import requests
  2. from bs4 import BeautifulSoup
  3. import csv
  4. # 请求URL
  5. url = 'https://movie.douban.com/top250'
  6. # 请求头部
  7. headers = {
  8. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
  9. }
  10. # 解析页面函数
  11. def parse_html(html):
  12. soup = BeautifulSoup(html, 'lxml')
  13. movie_list = soup.find('ol', class_='grid_view').find_all('li')
  14. for movie in movie_list:
  15. title = movie.find('div', class_='hd').find('span', class_='title').get_text()
  16. rating_num = movie.find('div', class_='star').find('span', class_='rating_num').get_text()
  17. comment_num = movie.find('div', class_='star').find_all('span')[-1].get_text()
  18. writer.writerow([title, rating_num, comment_num])
  19. # 保存数据函数
  20. def save_data():
  21. f = open('douban_movie_top250.csv', 'a', newline='', encoding='utf-8-sig')
  22. global writer
  23. writer = csv.writer(f)
  24. writer.writerow(['电影名称', '评分', '评价人数'])
  25. for i in range(10):
  26. url = 'https://movie.douban.com/top250?start=' str(i * 25) '