[ArXiv] PDF-Wukong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
Xudong Xie*, Liang Yin*, Hao Yan*, Yang Liu*, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen,Xiang Bai
💡 Monkey series projects:✨.
[CVPR'24] Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai
2024.10.10
🚀 We release the paper PDF-Wukong.
Text-only and Image-only indicate that the QA pairs are generated based on either a single text paragraph or an image extracted from the PDF. Meanwhile, Image-text, Section, and Cross-paragraph denote that the QA pairs are generated from a paragraph and its corresponding references, an entire section, or non-contiguous paragraphs, respectively.
PaperPDF is publicly available on Hugging Face Datasets: PaperPDF.
The structure of this repository is shown as follows.
PaperPDF
│
├── Original PDFs # Original PDF documents
│
│── filter.py # Code for filtering data based on rules.
│
├── Parsed Data
│ ├── PaperPDF.py # Code for extracting text and image information from XML documents
│ ├── pdf_xml # XML files generated by Grobid from the PDF documents
│ └── pdf_figure
│ ├── figure # Extracted images from the PDF documents
│ └── data # Metadate of the images
│
├── Train
│ ├── train_100w.jsonl # The complete 1 million training data
│ ├── train_50w.jsonl # 500,000 training data for ablation studies
│ └── train_10w.jsonl # 100,000 training data for ablation studies
│
└── Test
└── test.jsonl # The test set
For each instance in the dataset, the following fields are provided:
json
{
{
"PDF name": "1507.04291v1",
"Category": "single-text_img",
"Query": "According to Table 1, which sections discuss TCB-included Chebyshev kernels for both position and velocity?",
"Answer": ["Sections 5.3.3 and 5.3.4 discuss TCB-included Chebyshev kernels for both position and velocity.", "Sections 5.3.3."],
"Evidence": {
"Texts": [{"idx": 11, "Text": "The six SPK data types, listed in Table 1, for ephemerides of natural solar system bodies..."}],
"Figures": [{"idx": 220, "Caption": "Table 1: Double precision kernel data types of interest.", "Figure": "1507.04291v1-Table1-1.png"}]
}
}
...
}
PDF name
: a string containing the name of the PDF document.Category
: a string representing the category of the query, which can be one of the following:single-text_only
,single-image_only
,multi-text_image
,multi-section
,multi-cross_paragraph
.Query
: a string containing the question posed to the PDFAnswer
: an array of the two answers generated, the training set and test set has different prompt for the answers (see [title](### Dataset Creation) below for more details)Evidence
: an object containing supporting texts and figures (if provided) from the PDF document
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
@article{xie2024pdfwukong,
title={PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling},
author={Xie, Xudong and Yin, Liang and Yan, Hao and Liu, Yang and Ding, Jing and Liao, Minghui and Liu, Yuliang and Chen, Wei and Bai, Xiang},
year={2024},
journal={arXiv preprint arXiv:2410.05970},
url={https://arxiv.org/abs/2410.05970},
}
PDF-Wukong project is intended for non-commercial use only. For commercial inquiries, please contact haoyan at [email protected].