Technology sharing

Incipias cum imagine et cognitione textuum nulla difficultate ~ Simplex ocr et pdf ad txt innixum Feipian . converte

2024-07-08

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Insert imaginem descriptionis hic

Praefatio

Haec pdf apta est ad fenestras novitias puras utentium qui nullam habent scientiam fundamentalem recognitionis visualis. Estne, quaeso ut cuitu ~~
Animadverte:
PDF huius incepti scriptoris PDF nihil facit ad impedimentum interventus ut tabulae, textum, vestigium, notas, etc. Speramus igitur ut PDFs hoc munere utentes non contineant has impedimenta res quam maxime ad vitandum afficientes. translationem efficere.

processus

1. Ædifica amet

Creare pythonis virtualis environment utens conda

conda crate -n pp python==3.11

2. Installation sarcina

Install paxillum et paddleocr
gpu versionem

pituitam install paddlepaddle-gpu paddleocr

CPU version

pituitam install paddlepaddle paddleocr

pdf ut picturam instrumentum

https://github.com/oschwartz10612/poppler-windows/releases

pituitam install pdf2image

Imprimis 3. codice

Supponamus nos fasciculum imaginum pdf in pdfs folder habere, et necesse est nos unumquodque pdf limam in congruam tabellam converti.Hoc codice uti potes


from pdf2image import convert_from_path
import cv2
import numpy as np
from PIL import Image
import os
# 将 PDF 文件转换为图片列表
files = os.listdir('pdf')

for file in files:
    if not file.endswith('.pdf'):
        print(file)
        continue
    txt = file.replace('.pdf', '.txt')
    if os.path.exists('txt/' txt):
        continue
    txt_writer = open('txt/' txt, 'w',encoding='utf-8')
    images = convert_from_path('pdf/' file)
    # print(type(images))
    # print(images[0])
    # image = cv2.cvtColor(np.array(images[0]), cv2.COLOR_RGB2BGR)


    from paddleocr import PaddleOCR, draw_ocr

    # 创建 PaddleOCR 实例
    ocr = PaddleOCR(use_angle_cls=True, lang='ch',use_gpu=True)  # 默认使用英文模型,可以通过 lang 参数切换到中文模型

    # 遍历每一张图片并识别文字
    for i, image in enumerate(images):
        print('第{}张图片'.format(i 1))
        # 转换图片为可用于识别的格式
        # source = image.convert('RGB')
        image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)        # image.save(f'page_{i}.jpg')

        # 识别图片中的文字
        result = ocr.ocr(image, cls=True)

        # 打印识别结果
        try:
            for lines in result:
                for line in lines:
                    # print(line[1][0])
                    txt_writer.write(line[1][0] 'n')
        except:
            print(file '识别失败')
    txt_writer.close()


4.Attention

Cum ex hoc codice solum textus solis PDF extrahere possit, semel imagines vel tabulae producuntur, cognitio effectus paginae deterius erit