2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Getting data from a table in a PDF file is also a task that is easily involved in daily office work. For example, if we want to get the table data in a company's annual report, the PDF file may contain hundreds of pages of data.
It is too inefficient to copy and paste the data from the PDF table one by one into the Excel table.
Let's take a look at my method, how to solve it with Python. Use pdfplumber to read PDF files, automatically extract the table of each page, and write it into a new Excel file in a loop. The idea is the same as the traditional method, but how long does it take? Don't blink, I just need to press the run button, all operations are automatically run, all files are automatically generated, and the naming is done. I randomly open one or two to check, no problem, they are all accurate.
- ## 导入工具包
- import pdfplumber
- import pandas as pd
-
- ## 读取 PDF 文件
- p = pdfplumber.open("./贵州茅台2019年年报.pdf")
-
- # 选好读取全部页面
- for i in range(len(p.pages)):
- ## 读取一页中全部表格
- tables = p.pages[i].extract_tables()
- print(f'第{i+1}页一共有{len(tables)}个表格')
- for j in range(len(tables)):
- ## 生成表格
- df = pd.DataFrame(tables[j])
- ## 写入 Excel 文件
- df.to_excel(f'./贵州茅台2019年年报_第{i+1}页_第{j+1}张表.xlsx')
Effect: