Technology sharing

[Crawler] Data crawled Analyse

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Ad partem datam, praeter bibliothecam BeautifulSoup priorem, duo sunt modi: expressiones regulares et Xpath.

1. expressionibus regularibus

Expressiones regulares (RE ad breves) instrumenta adhibenda sunt ad exemplaria describenda et ad chordas componendas.

Late in textu processus, notitia sanationis, textus inquisitionis et postea in aliis missionibus adhibita est. Expressiones regulares speciali syntaxi utuntur ad adaptationem complexam exemplarium perficiendam in chordis.

Regular expression test:Test online expressio iusto

1. Communiter metacharacters

Metacharacteres: Notae speciales cum certis significationibus. Quisque metacharacter una tantum chorda compositus per defaltam et characteribus newlines aequare non potest.

metacharactersdescribereExemplum
.Nullo modo congruit newlinesa.b potest aequarea1bacb
wPar litteris, numeris vel underscoresw+ parhelloworld_123
sAequat aliqua whitespace moress+ Matches spatia, tabs, etc.
dNumeri pard+ par123456
nMatches a newline characterhellonworld Par newline mores
tpar in tab moreshellotworld par tab character
^Initium par filo^Hello parHello linea in principio
$Inserere finem filumWorld$ parWorld finis filum
WAequet non-littera, non-numerum, et non-underscore characteribusW+ par!@#$%^
DPar characteribus numerorum non-D+ parabcXYZ
SPar characteribus non-whitespaceS+ parhelloworld123
'ab`Par characters a aut moresb
(...)Expressio inter parentheses capit, coetus repraesentans(abc) captisabc
[...]Aequet ingenium in brackets quadrata[abc] parab or *c
[^...]Aequet quodlibet ingenium non inclusum in brackets quadrata[^abc] par nisiabc Quodlibet ingenium quam

2. Quantifier

Quantifier: numerum regit eventus metacharacteris praecedentis

quantifierdescribere
*Repetere nulla vel interdum
+Iterare vel temporibus
?Repetere nulla vel uno tempore
{n}Iterare n temporibus
{n,}Repetere n vel temporibus
{n,m}Iterare n ad m tempora

piger matching.*? : Par quam paucis characteribus quam maxime.Post crebris metacharacteribus add? Exsequendi iners adaptatio.
avarus matching.* : Compositus quam multa ingenia quam maxime. Defectus metacharacteres repetiti sunt avari.

piger matching

avarus matching

3.Re modulus

Ut in Pythone expressiones regulares processus, uti potes re modulus, cuius moduli copiam functionum praebet ad chordas quaerendas, adaptationes et abusiones.

officiumdescribere
re.search(pattern, string, flags=0)Quaere chorda et redde primum congruentem obiectum; None
re.match(pattern, string, flags=0)Forma par ab initio chordae; None
re.fullmatch(pattern, string, flags=0)Obiectum par refert si chorda integra exemplaris prorsus congruit, secus redit None
re.findall(pattern, string, flags=0)Refert index omnium par non-imbricatis in filo
re.finditer(pattern, string, flags=0)Redit iterator omnium non-imbricatis par in filo
re.sub(pattern, repl, string, count=0, flags=0)Omnes partes exemplar substituit congruentem cum filo reposito, reposito filo reddens
re.split(pattern, string, maxsplit=0, flags=0)Scinditur filum secundum formam matching et revertetur in split album
import re

# 示例文本
text = "在2024年,Python是最受欢迎的编程语言之一。Python 3.9版本在2020年发布。"

# 1. re.search() 搜索字符串,返回第一个匹配的对象
# 查找第一个数字序列
search_result = re.search(r'd+', text)
if search_result:
    print(f"re.search: 找到的第一个数字是 '{search_result.group()}',位置在 {search_result.start()}")  

# 2. re.match() 从字符串起始位置匹配模式
# 匹配字符串开头是否为 '在'
match_result = re.match(r'在', text)
if match_result:
    print(f"re.match: 匹配的字符串是 '{match_result.group()}',位于字符串的开始")

# 3. re.fullmatch() 整个字符串完全匹配模式
# 检查整个字符串是否只包含中文字符
fullmatch_result = re.fullmatch(r'[u4e00-u9fff]+', '在编程')
if fullmatch_result:
    print(f"re.fullmatch: 整个字符串完全匹配,匹配到的内容是 '{fullmatch_result.group()}'")  

# 4. re.findall() 返回字符串中所有非重叠匹配的列表
# 查找所有的数字序列
findall_result = re.findall(r'd+', text)
print(f"re.findall: 找到的所有数字序列是 {findall_result}") 

# 5. re.finditer() 返回字符串中所有非重叠匹配的迭代器
# 查找所有的数字序列,并逐一输出
finditer_result = re.finditer(r'd+', text)
for match in finditer_result:
    print(f"re.finditer: 找到的数字是 '{match.group()}',位置在 {match.start()}")  

# 6. re.sub() 用替换字符串替换匹配模式的所有部分
# 将所有数字替换为 '#'
sub_result = re.sub(r'd+', '#', text)
print(f"re.sub: 替换后的字符串是: {sub_result}") 

# 7. re.split() 根据模式匹配分割字符串
# 按照空白字符或标点分割字符串
split_result = re.split(r'[,。 ]+', text)
print(f"re.split: 分割后的字符串列表是: {split_result}") 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43

image-20240608195856954

4. Repo Douban movies

image-20240608200527961

e*<li>Proficiscens a tag paulatim congruit tag quod pelliculae nomen est<span class="title">tag, uti non-avarus modus (.*?) nullum characteribus congruit quae inter se existant usque ad proximum signum expressum invenitur, adhibito nomine coetus captio(?P<name>)Partem pelliculae titulum extrahit.

Re dictio scripturae;

<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>
  • 1

Codex Crawler:

import requests
import re
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}

for start_num in range(0, 250, 25):
    response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
    # 拿到页面源代码
    html = response.text
    # 使用re解析数据
    obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>',re.S)
    # 开始匹配
    result = obj.finditer(html)
    # 打印结果
    for it in result:
        print(it.group('name'))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

2. Xpath

Xpath est lingua quaerenda in documentis XML. Nodos vel nodi per expressions semitas potest eligere.

Modulus lxml install: pip install lxml

1. Xpath analysis

;

symbolumexplicare
/Radix nodi ex Select.
//Nodos eligit in documento e nodi hodiernae lectionis congruens, cuiuscumque positionis.
.Nodi hodiernam Select.
..Nodum parentem nodi hodiernae eligit.
@Proprietates selectae.

;

expressioexplicare
/bookstore/bookTotum librum sub nodis sub libraria nodi deligere.
//bookNodos libri omnes eligit in documento cujuscumque dignitatis.
bookstore/book[1]Primum librum puerum nodi sub bibliopolio nodi elige.
//title[@lang]Omnes nodos titulos cum lang attributo.
//title[@lang='en']Nodos titulos omnes elige quorum lang attributum est 'en'.

-

  • text(): Elementum textum lego.
  • @attr: Attributum elementi elige.
  • contains(): inclusio diffinitionis relationis.
  • starts-with(): Initium judicii.
from lxml import etree

html_content = '''
<html>
  <body>
    <div class="movie">
      <span class="title">肖申克的救赎</span>
      <span class="title">The Shawshank Redemption</span>
    </div>
    <div class="movie">
      <span class="title">霸王别姬</span>
      <span class="title">Farewell My Concubine</span>
    </div>
  </body>
</html>
'''

# 解析HTML
tree = etree.HTML(html_content)

# 提取电影标题
titles_cn = tree.xpath('//div[@class="movie"]/span[@class="title"][1]/text()')
titles_en = tree.xpath('//div[@class="movie"]/span[@class="title"][2]/text()')

# 打印结果
for cn, en in zip(titles_cn, titles_en):
    print(f'中文标题: {cn}, 英文标题: {en}')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
//div[@class="movie"]/span[@class="title"][1]/text()
  • 1

//div[@class="movie"]: Select all classes asmovieelementum div.

/span[@class="title"][1]: Select the class in each div astitlePrimum spatium elementum est.

/text(): Accipe textum contentum elementi span.

//div[@class="movie"]/span[@class="title"][2]/text()
  • 1

Similis est superiori phrasis, sed classis in unaquaque div eligitur.titleSecundum spatium elementum est.

2. Repo Douban movies

Douban

import requests
from lxml import etree

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0"
}

for start_num in range(0, 250, 25):
    response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
    # 拿到页面源代码
    html = response.text
    # 使用lxml解析页面
    html = etree.HTML(html)
    # 提取电影名字
    titles = html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()')
    # 提取评分
    ratings = html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()')
    # 打印结果
    for title, rating in zip(titles, ratings):
        print(f"电影: {title} 评分: {rating}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20