Technology Sharing

Batch extract table content from PDF

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

1 Background

Getting data from a table in a PDF file is also a task that is easily involved in daily office work. For example, if we want to get the table data in a company's annual report, the PDF file may contain hundreds of pages of data.

2 Traditional methods

It is too inefficient to copy and paste the data from the PDF table one by one into the Excel table.

3 Office Automation

Let's take a look at my method, how to solve it with Python. Use pdfplumber to read PDF files, automatically extract the table of each page, and write it into a new Excel file in a loop. The idea is the same as the traditional method, but how long does it take? Don't blink, I just need to press the run button, all operations are automatically run, all files are automatically generated, and the naming is done. I randomly open one or two to check, no problem, they are all accurate.

4 Code Implementation

  1. ## 导入工具包
  2. import pdfplumber
  3. import pandas as pd
  4. ## 读取 PDF 文件
  5. p = pdfplumber.open("./贵州茅台2019年年报.pdf")
  6. # 选好读取全部页面
  7. for i in range(len(p.pages)):
  8. ## 读取一页中全部表格
  9. tables = p.pages[i].extract_tables()
  10. print(f'第{i+1}页一共有{len(tables)}个表格')
  11. for j in range(len(tables)):
  12. ## 生成表格
  13. df = pd.DataFrame(tables[j])
  14. ## 写入 Excel 文件
  15. df.to_excel(f'./贵州茅台2019年年报_第{i+1}页_第{j+1}张表.xlsx')

Effect: