Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(loader): implement markdown parsing in MathpixPDFReader #498

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

eliasjudin
Copy link

@eliasjudin eliasjudin commented Nov 15, 2024

Add functionality to properly handle PDF content:

  • Add parse_markdown_text_to_tables method to separate tables and text
  • Fix load_data implementation to properly process documents
  • Fix lazy_load_data method
  • Improve document metadata handling for tables and text sections

The loader now correctly processes PDFs through Mathpix API and converts content to proper Document objects.

Description

  • Added functionality to properly parse and process PDF content in MathpixPDFReader:
    • Implemented parse_markdown_text_to_tables method to correctly separate tables and text sections
    • Fixed load_data implementation to process documents with proper metadata
    • Added lazy_load_data method for memory-efficient document loading
    • Improved document metadata handling for both tables and text sections
  • The loader now correctly processes PDFs through Mathpix API and converts content to proper Document objects

Type of change

  • New features (non-breaking change).
  • Bug fix (non-breaking change).
  • Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

  • I have performed a self-review of my code.
  • I have added thorough tests if it is a core feature.
  • There is a reference to the original bug report and related work.
  • I have commented on my code, particularly in hard-to-understand areas.
  • The feature is well documented.

Add functionality to properly handle PDF content:
- Add parse_markdown_text_to_tables method to separate tables and text
- Fix load_data implementation to properly process documents
- Fix lazy_load_data method
- Improve document metadata handling for tables and text sections

The loader now correctly processes PDFs through Mathpix API and converts content to proper Document objects.
@taprosoft taprosoft changed the title ✨ feat(loader): implement markdown parsing in MathpixPDFReader feat(loader): implement markdown parsing in MathpixPDFReader Nov 16, 2024
@taprosoft
Copy link
Collaborator

@eliasjudin please check the failed CI and comments.

…ation

Remove early returns using super() in load_data and lazy_load_data methods that were preventing the actual implementation from being executed. This fixes the "not implemented" error while maintaining the full PDF reader functionality.
eliasjudin

This comment was marked as off-topic.

@eliasjudin
Copy link
Author

need to resolve this issue

1 validation error for CiteEvidence
evidences
  Input should be a valid list [type=list_type, input_value='["Let $M$ be a multiplic...\in M$ and $x \\in X$"]', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/list_type
Got 0 cited docs

any suggestions?

@eliasjudin
Copy link
Author

page numbering is incorrect, all items say from page 1

@eliasjudin
Copy link
Author

need to resolve this issue

1 validation error for CiteEvidence
evidences
  Input should be a valid list [type=list_type, input_value='["Let $M$ be a multiplic...\in M$ and $x \\in X$"]', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/list_type
Got 0 cited docs

any suggestions?

fixed this by adding check if str in citation.py and then wrapping to list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants