Technology Sharing

Get started with image and text recognition with zero difficulty ~ Simple OCR for PDF and convert to txt based on Feijiang

2024-07-08

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

insert image description here

Preface

This pdf is suitable for pure newbie users who have no basic knowledge of visual recognition in Windows. Please take a detour~~
Notice:
The OCR of the PDF in this project does not handle any interference such as tables, drawings, text, watermarks, etc. Therefore, we hope that the PDFs you use this function will not contain these interference items as much as possible to avoid affecting the translation effect.

process

1. Build environment

Creating a virtual python environment with conda

conda crate -n pp python==3.11

2. Installation Package

Install paddle and paddleocr
gpu version

pip install paddlepaddle-gpu paddleocr

cpu version

pip install paddlepaddle paddleocr

PDF to Image Converter

https://github.com/oschwartz10612/poppler-windows/releases

pip install pdf2image

3. Specific code

Suppose we have a bunch of pdf files in the pdfs folder, and we need to convert each pdf file into a corresponding txt file. You can use the following code


from pdf2image import convert_from_path
import cv2
import numpy as np
from PIL import Image
import os
# 将 PDF 文件转换为图片列表
files = os.listdir('pdf')

for file in files:
    if not file.endswith('.pdf'):
        print(file)
        continue
    txt = file.replace('.pdf', '.txt')
    if os.path.exists('txt/' txt):
        continue
    txt_writer = open('txt/' txt, 'w',encoding='utf-8')
    images = convert_from_path('pdf/' file)
    # print(type(images))
    # print(images[0])
    # image = cv2.cvtColor(np.array(images[0]), cv2.COLOR_RGB2BGR)


    from paddleocr import PaddleOCR, draw_ocr

    # 创建 PaddleOCR 实例
    ocr = PaddleOCR(use_angle_cls=True, lang='ch',use_gpu=True)  # 默认使用英文模型,可以通过 lang 参数切换到中文模型

    # 遍历每一张图片并识别文字
    for i, image in enumerate(images):
        print('第{}张图片'.format(i 1))
        # 转换图片为可用于识别的格式
        # source = image.convert('RGB')
        image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)        # image.save(f'page_{i}.jpg')

        # 识别图片中的文字
        result = ocr.ocr(image, cls=True)

        # 打印识别结果
        try:
            for lines in result:
                for line in lines:
                    # print(line[1][0])
                    txt_writer.write(line[1][0] 'n')
        except:
            print(file '识别失败')
    txt_writer.close()


4. Note

Since this code can only extract text from PDF, once pictures or tables are output, the recognition effect of the page will be deteriorated. Please understand.