MarkdownSplitter: Return preceding headers #116

benbrandt · 2024-03-23T12:32:56Z

One benefit of having the extra Markdown structure, other than having better split points, is we can provide extra context to a given chunk from the headings that are relevant to a given chunk.

It would be great to have an alternate chunk method, that not only returns the chunk, but also any relevant context. Something like:

pub fn chunks_with_context<'splitter, 'text: 'splitter>(
    &'splitter self,
    text: &'text str,
    chunk_capacity: impl ChunkCapacity + 'splitter,
) -> impl Iterator<Item = (&'text str, Context)> + 'splitter;

Where Context is something like:

HashMap<HeadingLevel, &'text str>

with the corresponding header text of the most recent heading at each level.

This would traverse the document until it gets to the offset of a given chunk, keeping a reference to each level it encounters. But if it encounters a level it has already seen, then it will replace it with the new one and also remove any references to lower heading levels.

Todo:

Define how context should be returned. i.e. Should this just be a hashmap with headings?
This should be opt-in, so that if it isn't desired, the extra computation isn't performed.

The text was updated successfully, but these errors were encountered:

jackbravo · 2024-07-24T17:50:13Z

This sounds very interesting, and in line with this article that mentions this should improve relevancy of chunks and accuracy of results:

https://d-star.ai/solving-the-out-of-context-chunk-problem-for-rag

The example is very illustrative:

We’ll use Nike’s 2023 10-K to illustrate this. Here are the first 10 sections we identified:

Add contextual chunk headers

The purpose of the chunk header is to add context to the chunk text. Rather than using the chunk text by itself when embedding and reranking the chunk, we use the concatenation of the chunk header and the chunk text, as shown in the image above. This helps the ranking models (embeddings and rerankers) retrieve the correct chunks

benbrandt added this to text-splitter Roadmap Mar 23, 2024

benbrandt converted this from a draft issue Mar 23, 2024

benbrandt moved this from Backlog to Ready in text-splitter Roadmap Mar 23, 2024

benbrandt added the enhancement New feature or request label Apr 8, 2024

benbrandt mentioned this issue Oct 22, 2024

MarkdownSplitter: Smarter Table Splitting with Header Preservation #422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarkdownSplitter: Return preceding headers #116

MarkdownSplitter: Return preceding headers #116

benbrandt commented Mar 23, 2024 •

edited

Loading

jackbravo commented Jul 24, 2024

MarkdownSplitter: Return preceding headers #116

MarkdownSplitter: Return preceding headers #116

Comments

benbrandt commented Mar 23, 2024 • edited Loading

jackbravo commented Jul 24, 2024

benbrandt commented Mar 23, 2024 •

edited

Loading