Technology sharing

Python crawler principia parva et 3 casibus parvis (source code)

2024-07-11

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

1. Reptile principium

Tela reptans programmata est ad automatice contentus telam recuperandam. Processus utentium paginas pascendi simulat, fontem codicem paginarum obtinet petitiones HTTP mittens, et technologiae parsing et extrahendi utitur ad datam debitam obtinendam.

1. HTTP petitio et responsio processus

Repo HTTP petitionem mittit in scopo in loco. Postquam servo petitionem accepit, responsum HTTP reddet, quod codicem status, responsionem caput et corpus responsionis continet (contentus paginae interreti).

2. Communiter technologiae crawler

  • peto bibliotheca: Exempli gratiarequestsetaiohttpmittebat HTTP petitiones.
  • Parsing bibliotheca: Exempli gratiaBeautifulSouplxmletPyQuery, ad parse paginae contentus.
  • repositio: Exempli gratiapandasetSQLiteadsueta repsit data copia.
  • Async bibliotheca: Exempli gratiaasyncioetaiohttp, ad efficiendum asynchronos repentes et ad efficientiam reptantium emendandam.

2. Communiter usus bibliothecae Pythonis reptilia

1. Request bibliotheca

  • petitiones: Simplex et potens bibliotheca HTTP quae HTTP nexum pertinaciam et nexum collationis sustinet, certificatorium SSL verificationis, Crustulae, etc.
  • aiohttp: Asynchronous HTTP library based on assyncio, opportunus summus concursus trahentium missionibus.

2. Parsing bibliotheca

  • BeautifulSoup: Bibliotheca ad parsingem HTML et XML, facilis ad plures parsers usus et subsidia.
  • lxml: efficiens XML et HTML bibliotheca parsing quae XPath et CSS selectores sustinet.
  • PyQuery: A Python version of jQuery with similar syntax to jQuery and facile uti.

3. Repository

  • pandas: Analysis bibliotheca valida data quae dat structuras et instrumenta analysi datorum et subsidia multiplex formatorum fasciculorum.
  • SQLite: database leve quod SQL queries sustinet et pro parvis reptilium inceptis aptum est.

Deinceps 7 parvis causis Pythonis reptans utemur ut te adiuvet ut melius discas et cognoscas primas scientias Pythonis reptans. Prooemium et fons Codicis pro utroque casu sequitur:

Causa I: Repo summo CCL Douban movies

Hic ususBeautifulSoupBibliotheca reptat informationes sicut titulos cinematographicos, aestimationes, et numerum recensentium e Top 250 Douban cinematographicis, et hanc informationem in fasciculo CSV servat.

  1. import requests
  2. from bs4 import BeautifulSoup
  3. import csv
  4. # 请求URL
  5. url = 'https://movie.douban.com/top250'
  6. # 请求头部
  7. headers = {
  8. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
  9. }
  10. # 解析页面函数
  11. def parse_html(html):
  12. soup = BeautifulSoup(html, 'lxml')
  13. movie_list = soup.find('ol', class_='grid_view').find_all('li')
  14. for movie in movie_list:
  15. title = movie.find('div', class_='hd').find('span', class_='title').get_text()
  16. rating_num = movie.find('div', class_='star').find('span', class_='rating_num').get_text()
  17. comment_num = movie.find('div', class_='star').find_all('span')[-1].get_text()
  18. writer.writerow([title, rating_num, comment_num])
  19. # 保存数据函数
  20. def save_data():
  21. f = open('douban_movie_top250.csv', 'a', newline='', encoding='utf-8-sig')
  22. global writer
  23. writer = csv.writer(f)
  24. writer.writerow(['电影名称', '评分', '评价人数'])
  25. for i in range(10):
  26. url = 'https://movie.douban.com/top250?start=' str(i * 25) '