Technology Sharing

Batch extract the content of the specified area of ​​PDF to Excel, automatically rename according to the first line of text in the PDF file v1.3-with ideas and code implementation

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

This article updates the content. Pictures and scanned PDFs can also support batch extraction of specified area contents. This is mainly achieved by taking a screenshot of the specified area and then using OCR to recognize the text in the area. Therefore, the accuracy may be a bit insufficient, but if it is digital, it is not a big problem. Therefore, it is best to use a pure electronic version of the PDF file for the best extraction effect.


Requirement 1: If I have a large number of PDF electronic documents of the same format, I need to extract numbers or text from specific areas.

Requirement 2: I have a batch of PDF documents, but the file names are all garbled. I need to batch rename these files based on the title text of the first line of the first page of the PDF file.

Note: Not suitable for scenarios: If the locations of the areas to be extracted in multiple PDF files are different, for example, the number I want to extract is at the coordinates (30, 30) in the first PDF file, but becomes (35, 35) in the second file, then the software will not be able to extract the content text well. Therefore, the scope of application of this code is that the formats of multiple PDF documents are consistent and the PDF locations of the text information to be extracted are basically the same.

Idea 1: We randomly select a PDF file as a sample, and then use the code to mark the area to be extracted with a box, and then save the coordinates of these areas. When we batch process each PDF later, we extract the text or numbers at the corresponding position based on the saved area coordinates.

Idea diagram:

The final result diagram:

The flaws of this approach and points to note:

1 The location of the data to be extracted from each batch of files needs to be the same. For example, the number to be extracted from the first PDF file is located at the coordinate [100, 100]. Then the number to be extracted from each subsequent file must be located at this position. If there is a change, the required data cannot be extracted. This problem can be solved to a certain extent by expanding the coordinate range of the area.

2 If the extracted text is incomplete, it means that the selected box may be slightly smaller. I have set a function in my code to enlarge a certain area separately.


Idea for requirement 2: The names of a batch of PDF documents are all garbled. I need to batch rename these files according to the title of the first page of the PDF file. In fact, it is very simple. Just parse the PDF file, get the content of the first line, and then rename the file. The code is not complicated, so it is not put on this page.

Code:

  1. from typing import Optional, Dict, List
  2. from solapi.magic_eden.site_api.utils.consts import MEAPIUrls
  3. from solapi.magic_eden.site_api.utils.data import collection_stats_cleaner, collection_info_cleaner,
  4. collection_list_stats_cleaner
  5. from solapi.magic_eden.site_api.utils.types import MECollectionStats, MECollectionInfo, MECollectionMetrics
  6. from solapi.utils.api import BaseApi
  7. class MagicEdenCollectionApi(BaseApi):
  8. def get_collection_stats_dirty(self, symbol: str) -> Optional[Dict]:
  9. url = f'{MEAPIUrls.COLLECTION_STATS}{symbol}'
  10. res = self._get_request(url)
  11. return res.get('results') if isinstance(res, dict) else None
  12. def get_collection_info_dirty(self, symbol: str) -> Optional[Dict]:
  13. url = f'{MEAPIUrls.COLLECTION_INFO}{symbol}'
  14. res = self._get_request(url)
  15. return res if bool(res) else None
  16. def get_collection_stats(self, symbol: str) -> Optional[MECollectionStats]:
  17. data = self.get_collection_stats_dirty(symbol)
  18. if data:
  19. return collection_stats_cleaner(data)
  20. def get_collection_info(self, symbol: str) -> Optional[MECollectionInfo]:
  21. data = self.get_collection_info_dirty(symbol)
  22. if data:
  23. return collection_info_cleaner(data)
  24. def get_collection_list_stats_dirty(self):
  25. url = MEAPIUrls.COLLECTION_LIST_STATS
  26. res = self._get_request(url)
  27. return res.get('results') if isinstance(res, dict) else None
  28. def get_collection_list_stats(self) -> Optional[List[MECollectionMetrics]]:
  29. data = self.get_collection_list_stats_dirty()
  30. if data:
  31. return list(map(lambda x: collection_list_stats_cleaner(x), data))
  32. def get_collection_list_dirty(self):
  33. url = MEAPIUrls.COLLECTION_LIST
  34. res = self._get_request(url)
  35. return res.get('collections') if isinstance(res, dict) else None
  36. def get_collection_list(self) -> Optional[List[MECollectionInfo]]:
  37. data = self.get_collection_list_dirty()
  38. if data:
  39. return list(map(lambda x: collection_info_cleaner(x), data))

Code download link:

Link: https://pan.baidu.com/s/1WQQ8kaDilaagjoK5IrYZzA

Extraction code: 1111