MarkdownSplitter: Smarter Table Splitting with Header Preservation #422

hburrichter · 2024-10-19T14:34:27Z

First, thank you for your work on the text-splitter library!

I would like to propose a feature enhancement for smarter table splitting that preserves the header in all chunks.

Feature Request:

Implement a smarter table splitting feature that automatically includes the header row in each split section of a table.
Ensure that the markdown formatting is preserved across all splits.

Use Case:

This feature would be useful for markdown documents with large tables to maintain the readability and formatting consistency in each table chunk.

Example:

Consider a markdown table that is too large to fit in a single section:

| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Row 1    | Data     | More Data|
| Row 2    | Data     | More Data|
| Row 3    | Data     | More Data|
| Row 4    | Data     | More Data|

If the table needs to be split after the second row, the desired output would be:

Split 1:

| Header 1 | Header 2 | Header 3 |    
|----------|----------|----------|
| Row 1    | Data     | More Data|
| Row 2    | Data     | More Data|

Split 2:

| Header 1 | Header 2 | Header 3 |    <!-- currently missing -->
|----------|----------|----------|    <!-- currently missing -->
| Row 3    | Data     | More Data|
| Row 4    | Data     | More Data|

Thank you for considering this feature request!

benbrandt · 2024-10-22T08:31:07Z

Thanks for reaching out @hburrichter ! Would it be fine for you if it was returned as context, like I would do with headers? #116

This would allow you to choose what you want to do with it, but would require that you have enough buffer in your chunk to add it.

Basically your request is that if the chunk starts with a table row that is not the first row, that the first row gets added?

hburrichter · 2024-11-03T16:04:06Z

Basically your request is that if the chunk starts with a table row that is not the first row, that the first row gets added?

Yes @benbrandt , that is correct! This would fix any markdown rendering issues and add valuable context information to each chunk.

I think for compatibility reasons (merging chunks should return the original text), returning the heading row as context/metadata might be a good solution as you have already pointed out.

Another option might be to make this feature opt-in and put it behind a configuration parameter in the MarkdownSplitter class, e.g., include_header_in_chunks or preserve_table_header. The splitter could then directly return the table chunks with preserved headers reducing the need for manual post-processing.

mrfragger · 2024-12-19T17:25:19Z

something similar to this https://youtu.be/s_Vh9HIeLVg?list=PLNUVZZ6hfXX1Y4Is-SbbMF_HutRDJBwiO&t=1691
where it's continuing the html table for the next page on a pdf with typst (quatro).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarkdownSplitter: Smarter Table Splitting with Header Preservation #422

MarkdownSplitter: Smarter Table Splitting with Header Preservation #422

hburrichter commented Oct 19, 2024

benbrandt commented Oct 22, 2024

hburrichter commented Nov 3, 2024 •

edited

Loading

mrfragger commented Dec 19, 2024

MarkdownSplitter: Smarter Table Splitting with Header Preservation #422

MarkdownSplitter: Smarter Table Splitting with Header Preservation #422

Comments

hburrichter commented Oct 19, 2024

benbrandt commented Oct 22, 2024

hburrichter commented Nov 3, 2024 • edited Loading

mrfragger commented Dec 19, 2024

hburrichter commented Nov 3, 2024 •

edited

Loading