Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkdownSplitter: Smarter Table Splitting with Header Preservation #422

Open
hburrichter opened this issue Oct 19, 2024 · 3 comments
Open

Comments

@hburrichter
Copy link

Hello @benbrandt,

First, thank you for your work on the text-splitter library!

I would like to propose a feature enhancement for smarter table splitting that preserves the header in all chunks.

Feature Request:

  • Implement a smarter table splitting feature that automatically includes the header row in each split section of a table.
  • Ensure that the markdown formatting is preserved across all splits.

Use Case:

This feature would be useful for markdown documents with large tables to maintain the readability and formatting consistency in each table chunk.

Example:

Consider a markdown table that is too large to fit in a single section:

| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Row 1    | Data     | More Data|
| Row 2    | Data     | More Data|
| Row 3    | Data     | More Data|
| Row 4    | Data     | More Data|

If the table needs to be split after the second row, the desired output would be:

Split 1:

| Header 1 | Header 2 | Header 3 |    
|----------|----------|----------|
| Row 1    | Data     | More Data|
| Row 2    | Data     | More Data|

Split 2:

| Header 1 | Header 2 | Header 3 |    <!-- currently missing -->
|----------|----------|----------|    <!-- currently missing -->
| Row 3    | Data     | More Data|
| Row 4    | Data     | More Data|

Thank you for considering this feature request!

@benbrandt
Copy link
Owner

Thanks for reaching out @hburrichter ! Would it be fine for you if it was returned as context, like I would do with headers? #116

This would allow you to choose what you want to do with it, but would require that you have enough buffer in your chunk to add it.

Basically your request is that if the chunk starts with a table row that is not the first row, that the first row gets added?

@hburrichter
Copy link
Author

hburrichter commented Nov 3, 2024

Basically your request is that if the chunk starts with a table row that is not the first row, that the first row gets added?

Yes @benbrandt , that is correct! This would fix any markdown rendering issues and add valuable context information to each chunk.

I think for compatibility reasons (merging chunks should return the original text), returning the heading row as context/metadata might be a good solution as you have already pointed out.

Another option might be to make this feature opt-in and put it behind a configuration parameter in the MarkdownSplitter class, e.g., include_header_in_chunks or preserve_table_header. The splitter could then directly return the table chunks with preserved headers reducing the need for manual post-processing.

@mrfragger
Copy link

something similar to this https://youtu.be/s_Vh9HIeLVg?list=PLNUVZZ6hfXX1Y4Is-SbbMF_HutRDJBwiO&t=1691
where it's continuing the html table for the next page on a pdf with typst (quatro).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants