2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Foreplay:
1. Do you want to watch some pictures that will keep you awake late at night but have no resources?
2. Do you want to quickly buy train tickets during the peak travel season during holidays?
3. Do you want to quickly and accurately locate the products with the best reputation and quality when shopping online?
What is a crawler:
- Write a program to simulate the browser surfing the Internet and then let it crawl data from the Internet.
The value of crawlers:
- Practical Application
-Employment
Is scraping legal or illegal?
It is not prohibited by law
Risk of illegality
Good crawlers and bad crawlers
The risks brought by crawlers can be reflected in the following two aspects:
- The crawler interferes with the normal operation of the visited website
- The crawler crawls certain types of data or information that are protected by law
How to avoid being caught in a scam when writing crawlers?
- Frequently optimize your own programs to avoid interfering with the normal operation of the visited websites
- When using or disseminating crawled data, review the crawled content. If sensitive content such as user's commercial secrets is found, crawling or dissemination must be stopped immediately.
Classification of crawlers in usage scenarios
-General crawler:
An important part of the crawling system. It crawls a whole page of data.
-Focus on crawlers:
It is built on the basis of general crawlers. It crawls specific local content in the page.
-Incremental crawler:
Detect the data update status on the website. Only the latest updated data on the website will be captured.
Reptile Spear and Shield
Anti-climbing mechanism
Portal websites can prevent crawlers from crawling website data by formulating corresponding strategies or technical means.
Anti-climbing strategy
Crawler programs can crack the anti-crawling mechanism of portal websites by formulating relevant strategies or technical means, so as to obtain the portal
Robots.txt protocol: any website can be seen after +/robots.txt
Gentleman's agreement. It stipulates which data on the website can be crawled by the crawler and which data cannot be crawled.
http protocol
- Concept: It is a form of data interaction between the server and the client.
Common request header information: request carrier identity
- User-Agent:Connection: After the request is completed, whether to disconnect or keep the connection
Common response header information
-Content-Type: The data type that the server responds to the client
https protocol:- secure hypertext transfer protocol (security)
Encryption
Symmetric key encryption
Asymmetric key encryption
Certificate key encryption
- import requests
-
- # 检查当前脚本是否作为主程序运行
- if __name__ == "__main__":
- # 定义 KFC 官方网站获取门店列表信息的 URL
- url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx'
-
- # 定义 HTTP 请求的头部信息,模拟浏览器请求
- headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0'
- }
-
- # 提示用户输入一个位置信息,作为搜索门店的关键词
- keyword = input('enter location:') # 例如 "北京"
-
- # 循环请求第1页到第9页的数据
- for page in range(1,10):
- # 定义发送请求时所需的参数
- params = {
- 'op': 'keyword', # 操作类型为关键词搜索
- 'cname': '', # 城市名称(此处为空)
- 'pid': '', # 其他参数(此处为空)
- 'keyword': keyword, # 用户输入的关键词
- 'pageIndex': page, # 当前请求的页面索引
- 'pageSize': 1000, # 每页显示的门店数量
- }
-
- # 尝试发送请求并处理响应
- try:
- # 发送 GET 请求,请求参数包括 URL、参数和头部信息
- response = requests.get(url=url, params=params, headers=headers)
-
- # 检查响应状态码,如果不是 200,将引发 HTTPError 异常
- response.raise_for_status()
-
- # 获取响应内容
- page_text = response.text
-
- # 构建文件名,包括关键词、页码和 .html 扩展名
- filename = f'{keyword}_page_{page}.html'
-
- # 打开一个文件,以写入模式打开,并指定编码为 utf-8
- with open(filename, 'w', encoding='utf-8') as fp:
- # 将响应内容写入到文件中
- fp.write(page_text)
-
- # 打印一条消息,表明文件已经成功保存
- print(f'{filename} 保存成功!!!')
-
- # 捕获由 requests 库抛出的任何异常
- except requests.RequestException as e:
- # 打印异常信息
- print(f'请求错误: {e}')