기술나눔

PDF를 마크다운으로 변환하기 위한 오픈 소스 도구 분석

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

마커: PDF를 마크다운으로 변환하기 위한 오픈 소스 도구 분석
Marker는 Vik Paruchuri가 GitHub에서 개발한 오픈 소스 프로젝트입니다. 핵심 기능은 PDF 파일을 Markdown 형식으로 변환하는 것입니다. Marker 프로젝트에 대한 자세한 분석은 다음과 같습니다.

프로젝트 개요:

프로젝트 링크: https://github.com/VikParuchuri/marker.git
관리자: VikParuchuri
주요 기능: PDF를 Markdown 형식으로 빠르고 정확하게 변환하여 특히 책과 과학 논문과 같은 다양한 문서 유형을 지원합니다.

기술적 기능들:

딥 러닝 모델: Marker는 일련의 딥 러닝 모델을 사용하여 텍스트를 추출하고, 페이지 레이아웃을 감지하고, 텍스트 블록을 정리하고 서식을 지정하고, 최종적으로 이를 Markdown 문서에 결합합니다.
OCR 지원: OCR이 필요한 시나리오의 경우 Marker는 텍스트 추출의 정확성을 보장하기 위해 Surya 및 Tesseract와 같은 OCR 도구 사용을 지원합니다.
다중 플랫폼 지원: Marker는 다양한 하드웨어 환경의 요구 사항을 충족하기 위해 GPU, CPU 또는 MPS에서 실행될 수 있습니다.

기능 세부정보:

문서 처리: 머리글, 바닥글 및 기타 불순물 제거, 테이블 및 코드 블록 서식 지정, 이미지 추출 및 저장을 지원합니다.
언어 지원: Marker는 모든 언어를 지원하며 사용자는 언어 목록을 지정하여 OCR 효과를 최적화할 수 있습니다.
방정식 변환: 대부분의 방정식을 LaTeX 형식으로 변환할 수 있으므로 Markdown 문서에 수학 공식을 쉽게 포함할 수 있습니다.

성능:

속도 및 정확성: Marker는 속도와 정확성이 뛰어나 특히 누가(nougat)와 같은 다른 도구와 비교할 때 상당한 이점을 제공합니다.
리소스 사용량: A6000 Ada에서 실행 시 각 작업은 평균 약 4GB의 VRAM을 차지하여 여러 문서의 병렬 처리를 지원합니다.

사용자 지침:

설치: 사용자는 pip를 통해 marker-pdf 패키지를 설치해야 합니다.

pip install marker-pdf 

  • 1
  • 2
(GraphRAG) PS D:python-workspaceGraphRAG> pip install marker-pdf 
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting marker-pdf
  Downloading https://mirrors.aliyun.com/pypi/packages/05/c1/782f56407ea60bd35c127c829b8e43da99a0da41f6c9ee002cab97e430c5/marker_pdf-0.2.15-py3-none-any.whl (63 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.2/63.2 kB 563.9 kB/s eta 0:00:00
Requirement already satisfied: Pillow<11.0.0,>=10.1.0 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (10.4.0)
Requirement already satisfied: filetype<2.0.0,>=1.2.0 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (1.2.0)
Collecting ftfy<7.0.0,>=6.1.1 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/f4/f0/21efef51304172736b823689aaf82f33dbc64f54e9b046b75f5212d5cee7/ftfy-6.2.0-py3-none-any.whl (54 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.4/54.4 kB 353.5 kB/s eta 0:00:00
Requirement already satisfied: grpcio<2.0.0,>=1.63.0 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (1.64.1)
Requirement already satisfied: numpy<2.0.0,>=1.26.1 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (1.26.4)
Collecting pdftext<0.4.0,>=0.3.10 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/54/78/8dd39d5ed3b90fb7ecaa20f92ff09c4594877a88501de6352d22e8c53aa0/pdftext-0.3.10-py3-none-any.whl (25 kB)
Requirement already satisfied: pydantic<3.0.0,>=2.4.2 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (2.8.0)
Collecting pydantic-settings<3.0.0,>=2.0.3 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/e8/4f/aad03d5f711717d94d7de9684cb542343b392df1ad6889118636674fc983/pydantic_settings-2.3.4-py3-none-any.whl (22 kB)
Requirement already satisfied: python-dotenv<2.0.0,>=1.0.0 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (1.0.1)
Collecting rapidfuzz<4.0.0,>=3.8.1 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/60/a6/6c2f5e9be933150a6d55ffce4ff6d9701ddfc5b267c789a84674eadbd373/rapidfuzz-3.9.4-cp311-cp311-win_amd64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 684.6 kB/s eta 0:00:00
Requirement already satisfied: regex<2025.0.0,>=2024.4.28 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (2024.5.15)
Requirement already satisfied: scikit-learn<2.0.0,>=1.3.2 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (1.5.0)
Collecting surya-ocr<0.5.0,>=0.4.14 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/62/a8/dd78c484fa9a459e388a31aa3a45d23eb454c6aeb2a17710284631088615/surya_ocr-0.4.14-py3-none-any.whl (94 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94.5/94.5 kB 773.2 kB/s eta 0:00:00
Collecting tabulate<0.10.0,>=0.9.0 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/40/44/4a5f08c96eb108af5cb50b41f76142f0afa346dfa99d5296fe7202a11854/tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting texify<0.2.0,>=0.1.10 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/76/26/c12d194dd90bd78b524a7054e9125685efc32149d29005ca61c72ff4c126/texify-0.1.10-py3-none-any.whl (30 kB)
Collecting torch<3.0.0,>=2.2.2 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/d3/1d/a257913c89572de61316461db91867f87519146e58132cdeace3d9ffbe1f/torch-2.3.1-cp311-cp311-win_amd64.whl (159.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.8/159.8 MB 635.4 kB/s eta 0:00:00
Requirement already satisfied: tqdm<5.0.0,>=4.66.1 in e:programdataminiconda3envsgraphraglibsite-packages (from marker-pdf) (4.66.4)
Collecting transformers<5.0.0,>=4.36.2 (from marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/20/5c/244db59e074e80248fdfa60495eeee257e4d97c3df3487df68be30cd60c8/transformers-4.42.3-py3-none-any.whl (9.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.3/9.3 MB 639.8 kB/s eta 0:00:00
Collecting wcwidth<0.3.0,>=0.2.12 (from ftfy<7.0.0,>=6.1.1->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/fd/84/fd2ba7aafacbad3c4201d395674fc6348826569da3c0937e75505ead3528/wcwidth-0.2.13-py2.py3-none-any.whl (34 kB)
Collecting pypdfium2<5.0.0,>=4.29.0 (from pdftext<0.4.0,>=0.3.10->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/25/bd/56d9ec6b9f0fc4e0d95288759f3179f0fcd34b1a1526b75673d2f6d5196f/pypdfium2-4.30.0-py3-none-win_amd64.whl (2.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 627.3 kB/s eta 0:00:00
Requirement already satisfied: annotated-types>=0.4.0 in e:programdataminiconda3envsgraphraglibsite-packages (from pydantic<3.0.0,>=2.4.2->marker-pdf) (0.7.0)
Requirement already satisfied: pydantic-core==2.20.0 in e:programdataminiconda3envsgraphraglibsite-packages (from pydantic<3.0.0,>=2.4.2->marker-pdf) (2.20.0)
Requirement already satisfied: typing-extensions>=4.6.1 in e:programdataminiconda3envsgraphraglibsite-packages (from pydantic<3.0.0,>=2.4.2->marker-pdf) (4.12.2)
Requirement already satisfied: scipy>=1.6.0 in e:programdataminiconda3envsgraphraglibsite-packages (from scikit-learn<2.0.0,>=1.3.2->marker-pdf) (1.12.0)
Requirement already satisfied: joblib>=1.2.0 in e:programdataminiconda3envsgraphraglibsite-packages (from scikit-learn<2.0.0,>=1.3.2->marker-pdf) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in e:programdataminiconda3envsgraphraglibsite-packages (from scikit-learn<2.0.0,>=1.3.2->marker-pdf) (3.5.0)
Collecting opencv-python<5.0.0.0,>=4.9.0.80 (from surya-ocr<0.5.0,>=0.4.14->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/ec/6c/fab8113424af5049f85717e8e527ca3773299a3c6b02506e66436e19874f/opencv_python-4.10.0.84-cp37-abi3-win_amd64.whl (38.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.8/38.8 MB 546.5 kB/s eta 0:00:00
Collecting filelock (from torch<3.0.0,>=2.2.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/ae/f0/48285f0262fe47103a4a45972ed2f9b93e4c80b8fd609fa98da78b2a5706/filelock-3.15.4-py3-none-any.whl (16 kB)
Collecting sympy (from torch<3.0.0,>=2.2.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/61/53/e18c8c97d0b2724d85c9830477e3ebea3acf1dcdc6deb344d5d9c93a9946/sympy-1.12.1-py3-none-any.whl (5.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/5.7 MB 624.0 kB/s eta 0:00:00
Requirement already satisfied: networkx in e:programdataminiconda3envsgraphraglibsite-packages (from torch<3.0.0,>=2.2.2->marker-pdf) (3.3)
Collecting jinja2 (from torch<3.0.0,>=2.2.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/31/80/3a54838c3fb461f6fec263ebf3a3a41771bd05190238de3486aae8540c36/jinja2-3.1.4-py3-none-any.whl (133 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.3/133.3 kB 492.1 kB/s eta 0:00:00
Requirement already satisfied: fsspec in e:programdataminiconda3envsgraphraglibsite-packages (from torch<3.0.0,>=2.2.2->marker-pdf) (2024.6.1)
Collecting mkl<=2021.4.0,>=2021.1.1 (from torch<3.0.0,>=2.2.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/fe/1c/5f6dbf18e8b73e0a5472466f0ea8d48ce9efae39bd2ff38cebf8dce61259/mkl-2021.4.0-py2.py3-none-win_amd64.whl (228.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.5/228.5 MB 597.0 kB/s eta 0:00:00
Requirement already satisfied: colorama in e:programdataminiconda3envsgraphraglibsite-packages (from tqdm<5.0.0,>=4.66.1->marker-pdf) (0.4.6)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers<5.0.0,>=4.36.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/69/d6/73f9d1b7c4da5f0544bc17680d0fa9932445423b90cd38e1ee77d001a4f5/huggingface_hub-0.23.4-py3-none-any.whl (402 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 402.6/402.6 kB 598.2 kB/s eta 0:00:00
Requirement already satisfied: packaging>=20.0 in e:programdataminiconda3envsgraphraglibsite-packages (from transformers<5.0.0,>=4.36.2->marker-pdf) (23.2)
Requirement already satisfied: pyyaml>=5.1 in e:programdataminiconda3envsgraphraglibsite-packages (from transformers<5.0.0,>=4.36.2->marker-pdf) (6.0.1)
Requirement already satisfied: requests in e:programdataminiconda3envsgraphraglibsite-packages (from transformers<5.0.0,>=4.36.2->marker-pdf) (2.32.3)
Collecting safetensors>=0.4.1 (from transformers<5.0.0,>=4.36.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/cb/f6/19f268662be898ff2a23ac06f8dd0d2956b2ecd204c96e1ee07ba292c119/safetensors-0.4.3-cp311-none-win_amd64.whl (287 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 287.3/287.3 kB 571.3 kB/s eta 0:00:00
Collecting tokenizers<0.20,>=0.19 (from transformers<5.0.0,>=4.36.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/65/8e/6d7d72b28f22c422cff8beae10ac3c2e4376b9be721ef8167b7eecd1da62/tokenizers-0.19.1-cp311-none-win_amd64.whl (2.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 625.9 kB/s eta 0:00:00
Collecting intel-openmp==2021.* (from mkl<=2021.4.0,>=2021.1.1->torch<3.0.0,>=2.2.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/6f/21/b590c0cc3888b24f2ac9898c41d852d7454a1695fbad34bee85dba6dc408/intel_openmp-2021.4.0-py2.py3-none-win_amd64.whl (3.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.5/3.5 MB 487.9 kB/s eta 0:00:00
Collecting tbb==2021.* (from mkl<=2021.4.0,>=2021.1.1->torch<3.0.0,>=2.2.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/f1/24/500811330b3b070e5995c3275181dbcd00c06cef26c6ebfe6ee1ca9b6223/tbb-2021.13.0-py3-none-win_amd64.whl (286 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 286.9/286.9 kB 505.8 kB/s eta 0:00:00
Collecting MarkupSafe>=2.0 (from jinja2->torch<3.0.0,>=2.2.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/b7/a2/c78a06a9ec6d04b3445a949615c4c7ed86a0b2eb68e44e7541b9d57067cc/MarkupSafe-2.1.5-cp311-cp311-win_amd64.whl (17 kB)
Requirement already satisfied: charset-normalizer<4,>=2 in e:programdataminiconda3envsgraphraglibsite-packages (from requests->transformers<5.0.0,>=4.36.2->marker-pdf) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in e:programdataminiconda3envsgraphraglibsite-packages (from requests->transformers<5.0.0,>=4.36.2->marker-pdf) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in e:programdataminiconda3envsgraphraglibsite-packages (from requests->transformers<5.0.0,>=4.36.2->marker-pdf) (2.2.2)
Requirement already satisfied: certifi>=2017.4.17 in e:programdataminiconda3envsgraphraglibsite-packages (from requests->transformers<5.0.0,>=4.36.2->marker-pdf) (2024.6.2)
Collecting mpmath<1.4.0,>=1.1.0 (from sympy->torch<3.0.0,>=2.2.2->marker-pdf)
  Downloading https://mirrors.aliyun.com/pypi/packages/43/e3/7d92a15f894aa0c9c4b49b8ee9ac9850d6e63b03c9c32c0367a13ae62209/mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 495.3 kB/s eta 0:00:00
Installing collected packages: wcwidth, tbb, mpmath, intel-openmp, tabulate, sympy, safetensors, rapidfuzz, pypdfium2
, opencv-python, mkl, MarkupSafe, ftfy, filelock, jinja2, huggingface-hub, torch, tokenizers, pydantic-settings, transformers, pdftext, texify, surya-ocr, marker-pdf
Successfully installed MarkupSafe-2.1.5 filelock-3.15.4 ftfy-6.2.0 huggingface-hub-0.23.4 intel-openmp-2021.4.0 jinja2-3.1.4 marker-pdf-0.2.15 mkl-2021.4.0 mpmath-1.3.0 opencv-python-4.10.0.84 pdftext-0.3.10 pydantic-settings-2.3.4 pypdfium2-4.30.0 rapidfuzz-3.9.4 safetensors-0.4.3 surya-ocr-0.4.14 sympy-1.12.1 tabulate-0.9.0 tbb-2021.13.0 texify-0.1.10 tokenizers-0.19.1 torch-2.3.1 transformers-4.42.3 wcwidth-0.2.13


使用示例:

```bash
(GraphRAG) PS D:python-workspaceGraphRAG> marker_single GPT.pdf ./folder --batch_multiplier 2 --max_pages 52 --langs English
config.json: 100%|█████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<?, ?B/s] 
model.safetensors: 100%|█████████████████████████████████████████████████████████| 120M/120M [00:07<00:00, 16.7MB/s] 
Loaded detection model vikp/surya_det2 on device cpu with dtype torch.float32
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████| 430/430 [00:00<?, ?B/s] 
config.json: 100%|█████████████████████████████████████████████████████████████████████| 1.57k/1.57k [00:00<?, ?B/s] 
model.safetensors: 100%|█████████████████████████████████████████████████████████| 120M/120M [00:06<00:00, 18.0MB/s] 
Loaded detection model vikp/surya_layout2 on device cpu with dtype torch.float32
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████| 430/430 [00:00<?, ?B/s] 
config.json: 100%|█████████████████████████████████████████████████████████████████████| 5.04k/5.04k [00:00<?, ?B/s] 
model.safetensors: 100%|█████████████████████████████████████████████████████████| 550M/550M [00:34<00:00, 16.2MB/s] 
generation_config.json: 100%|██████████████████████████████████████████████████████████████| 160/160 [00:00<?, ?B/s] 
Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████| 684/684 [00:00<?, ?B/s] 
config.json: 100%|█████████████████████████████████████████████████████████████| 6.91k/6.91k [00:00<00:00, 6.82MB/s] 
model.safetensors: 100%|███████████████████████████████████████████████████████| 1.05G/1.05G [01:04<00:00, 16.2MB/s] 
generation_config.json: 100%|██████████████████████████████████████████████████████████████| 181/181 [00:00<?, ?B/s]
Loaded recognition model vikp/surya_rec on device cpu with dtype torch.float32
preprocessor_config.json: 100%|█████████████████████████████████████████████████████| 608/608 [00:00<00:00, 605kB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████| 4.92k/4.92k [00:00<?, ?B/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████| 625M/625M [00:38<00:00, 16.4MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████| 191/191 [00:00<?, ?B/s]
Loaded texify model to cpu with torch.float32 dtype
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████| 617/617 [00:00<?, ?B/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████| 4.49k/4.49k [00:00<?, ?B/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████| 2.14M/2.14M [00:00<00:00, 2.85MB/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████████████| 18.3k/18.3k [00:00<?, ?B/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████| 552/552 [00:00<00:00, 6.29MB/s] 
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████| 7/7 [05:49<00:00, 49.99s/it] 
Recognizing Text: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.37s/it] 
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████| 5/5 [05:32<00:00, 66.45s/it] 
Finding reading order: 100%|██████████████████████████████████████████████████████████| 5/5 [03:15<00:00, 39.04s/it] 
Saved markdown to the ./folderGPT folder

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134

配置:用户可以通过环境变量或配置文件调整Marker的行为,如设置OCR引擎、指定GPU设备、配置内存使用等。
命令行工具:Marker提供了命令行工具,允许用户以批处理方式转换单个或多个PDF文件。




商业使用与许可:

商业限制:虽然研究和个人使用是免费的,但商业使用受到一定限制。模型权重采用cc-by-nc-sa-4.0许可证,但作者为符合条件的小型组织提供了许可证豁免。
双许可选项:对于需要去除GPL许可证要求或超出收入限制的商业用户,提供了双许可选项。


社区与支持:

Discord社区:用户可以在Discord上讨论Marker的未来开发和其他相关问题。
文档与示例:GitHub仓库提供了详细的文档和示例,帮助用户快速上手。



总结:
Marker是一个功能强大、易于使用的PDF转Markdown工具,通过深度学习模型和OCR技术的结合,实现了高效且准确的文档转换。它不仅支持多种文档类型和语言,还提供了丰富的配置选项和命令行工具,满足了不同用户的需求。同时,Marker的社区支持和文档也非常完善,为用户提供了良好的使用体验。
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22