Batch extract PDF table content

Batch extract table content from PDF

2024-07-12

1 Background

Getting data from a table in a PDF file is also a task that is easily involved in daily office work. For example, if we want to get the table data in a company's annual report, the PDF file may contain hundreds of pages of data.

2 Traditional methods

It is too inefficient to copy and paste the data from the PDF table one by one into the Excel table.

3 Office Automation

Let's take a look at my method, how to solve it with Python. Use pdfplumber to read PDF files, automatically extract the table of each page, and write it into a new Excel file in a loop. The idea is the same as the traditional method, but how long does it take? Don't blink, I just need to press the run button, all operations are automatically run, all files are automatically generated, and the naming is done. I randomly open one or two to check, no problem, they are all accurate.

4 Code Implementation


## 导入工具包
import pdfplumber
import pandas as pd
 
## 读取 PDF 文件
p = pdfplumber.open("./贵州茅台2019年年报.pdf")
 
# 选好读取全部页面
for i in range(len(p.pages)):
    ## 读取一页中全部表格
    tables = p.pages[i].extract_tables()
    print(f'第{i+1}页一共有{len(tables)}个表格') 
    for j in range(len(tables)):
        ## 生成表格
        df = pd.DataFrame(tables[j])
        ## 写入 Excel 文件
        df.to_excel(f'./贵州茅台2019年年报_第{i+1}页_第{j+1}张表.xlsx')

Effect:

Technology Sharing

Batch extract table content from PDF

1 Background

2 Traditional methods

3 Office Automation

4 Code Implementation

Personal profile

my contact information