Papers
arxiv:2409.18839

MinerU: An Open-Source Solution for Precise Document Content Extraction

Published on Sep 27, 2024
ยท Submitted by wanderkid on Sep 30, 2024
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.

Community

Paper author Paper submitter

Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU

@wanderkid there seems to be no pdf document available on the arxiv page: "No document for '2409.18839'"
https://arxiv.org/pdf/2409.18839 returns an error but
https://arxiv.org/pdf/2409.18839? (with the question mark) seems to work for some reason

Paper author Paper submitter

@KT313 Thank you for pointing this out. It seems there is a temporary issue with the arXiv server. Please wait for a while and try accessing the document again later. If the problem persists, feel free to contact us.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.18839 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.18839 in a dataset README.md to link it from this page.

Spaces citing this paper 5

Collections including this paper 5