Technology Sharing

Introduction to Python crawler basics

2024-07-11

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

step

  1. Get web page content:

    1. http request

    2. Python's Requests library

  2. Parsing web page content

    1. HTML page structure

    2. Beautiful Soup library for Python

  3. Store or analyze data

    1. Store in database

    2. Data as AI Analysis

    3. Convert to chart display

DDoS attacks

By sending a large number of high-frequency requests to the server, a large amount of web resources are consumed, affecting the requests of other users

Follow the rules

You can check the robots.txt file of the website to understand the range of web page paths that can be crawled.

HTTP (Hypertext Transfer Protocol)

  1. A request-response protocol between a client and a server.

  2. Request method: (commonly used)

    1. GET: Get data

    2. POST: Create data

  3. The request consists of:
    POST /user/info HTTP/1.1           #请求行(包含方法类型、资源路径、协议版本)
    Host:www.example.com              #请求头
    User-Agent:curl/7.77.0            #请求头
    Accept:*/*                        #请求头
    
    {"username":"呦呦呦",              #请求体
    "email":"[email protected]"}      #请求头