feat(loader): implement markdown parsing in MathpixPDFReader #498

eliasjudin · 2024-11-15T13:11:43Z

Add functionality to properly handle PDF content:

Add parse_markdown_text_to_tables method to separate tables and text
Fix load_data implementation to properly process documents
Fix lazy_load_data method
Improve document metadata handling for tables and text sections

The loader now correctly processes PDFs through Mathpix API and converts content to proper Document objects.

Description

Added functionality to properly parse and process PDF content in MathpixPDFReader:
- Implemented parse_markdown_text_to_tables method to correctly separate tables and text sections
- Fixed load_data implementation to process documents with proper metadata
- Added lazy_load_data method for memory-efficient document loading
- Improved document metadata handling for both tables and text sections
The loader now correctly processes PDFs through Mathpix API and converts content to proper Document objects

Type of change

New features (non-breaking change).
Bug fix (non-breaking change).
Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

I have performed a self-review of my code.
I have added thorough tests if it is a core feature.
There is a reference to the original bug report and related work.
I have commented on my code, particularly in hard-to-understand areas.
The feature is well documented.

Add functionality to properly handle PDF content: - Add parse_markdown_text_to_tables method to separate tables and text - Fix load_data implementation to properly process documents - Fix lazy_load_data method - Improve document metadata handling for tables and text sections The loader now correctly processes PDFs through Mathpix API and converts content to proper Document objects.

taprosoft · 2024-11-16T04:31:02Z

@eliasjudin please check the failed CI and comments.

libs/kotaemon/kotaemon/loaders/mathpix_loader.py

…ation Remove early returns using super() in load_data and lazy_load_data methods that were preventing the actual implementation from being executed. This fixes the "not implemented" error while maintaining the full PDF reader functionality.

eliasjudin · 2024-11-17T12:23:49Z

need to resolve this issue

1 validation error for CiteEvidence
evidences
  Input should be a valid list [type=list_type, input_value='["Let $M$ be a multiplic...\in M$ and $x \\in X$"]', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/list_type
Got 0 cited docs

any suggestions?

eliasjudin · 2024-11-17T13:03:15Z

page numbering is incorrect, all items say from page 1

eliasjudin · 2024-11-17T13:04:42Z

need to resolve this issue

1 validation error for CiteEvidence
evidences
  Input should be a valid list [type=list_type, input_value='["Let $M$ be a multiplic...\in M$ and $x \\in X$"]', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/list_type
Got 0 cited docs

any suggestions?

fixed this by adding check if str in citation.py and then wrapping to list

taprosoft changed the title ~~✨ feat(loader): implement markdown parsing in MathpixPDFReader~~ feat(loader): implement markdown parsing in MathpixPDFReader Nov 16, 2024

taprosoft reviewed Nov 16, 2024

View reviewed changes

libs/kotaemon/kotaemon/loaders/mathpix_loader.py Outdated Show resolved Hide resolved

taprosoft reviewed Nov 16, 2024

View reviewed changes

libs/kotaemon/kotaemon/loaders/mathpix_loader.py Outdated Show resolved Hide resolved

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loader): implement markdown parsing in MathpixPDFReader #498

feat(loader): implement markdown parsing in MathpixPDFReader #498

eliasjudin commented Nov 15, 2024 •

edited

Loading

taprosoft commented Nov 16, 2024

This comment was marked as off-topic.

eliasjudin commented Nov 17, 2024

eliasjudin commented Nov 17, 2024

eliasjudin commented Nov 17, 2024

feat(loader): implement markdown parsing in MathpixPDFReader #498

Are you sure you want to change the base?

feat(loader): implement markdown parsing in MathpixPDFReader #498

Conversation

eliasjudin commented Nov 15, 2024 • edited Loading

Description

Type of change

Checklist

taprosoft commented Nov 16, 2024

This comment was marked as off-topic.

eliasjudin commented Nov 17, 2024

eliasjudin commented Nov 17, 2024

eliasjudin commented Nov 17, 2024

eliasjudin commented Nov 15, 2024 •

edited

Loading