Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkdownSplitter: Return preceding headers #116

Open
2 tasks
benbrandt opened this issue Mar 23, 2024 · 1 comment
Open
2 tasks

MarkdownSplitter: Return preceding headers #116

benbrandt opened this issue Mar 23, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@benbrandt
Copy link
Owner

benbrandt commented Mar 23, 2024

One benefit of having the extra Markdown structure, other than having better split points, is we can provide extra context to a given chunk from the headings that are relevant to a given chunk.

It would be great to have an alternate chunk method, that not only returns the chunk, but also any relevant context. Something like:

pub fn chunks_with_context<'splitter, 'text: 'splitter>(
    &'splitter self,
    text: &'text str,
    chunk_capacity: impl ChunkCapacity + 'splitter,
) -> impl Iterator<Item = (&'text str, Context)> + 'splitter;

Where Context is something like:

HashMap<HeadingLevel, &'text str>

with the corresponding header text of the most recent heading at each level.

This would traverse the document until it gets to the offset of a given chunk, keeping a reference to each level it encounters. But if it encounters a level it has already seen, then it will replace it with the new one and also remove any references to lower heading levels.

Todo:

  • Define how context should be returned. i.e. Should this just be a hashmap with headings?
  • This should be opt-in, so that if it isn't desired, the extra computation isn't performed.
@benbrandt benbrandt converted this from a draft issue Mar 23, 2024
@benbrandt benbrandt moved this from Backlog to Ready in text-splitter Roadmap Mar 23, 2024
@benbrandt benbrandt added the enhancement New feature or request label Apr 8, 2024
@jackbravo
Copy link

This sounds very interesting, and in line with this article that mentions this should improve relevancy of chunks and accuracy of results:

https://d-star.ai/solving-the-out-of-context-chunk-problem-for-rag

The example is very illustrative:

We’ll use Nike’s 2023 10-K to illustrate this. Here are the first 10 sections we identified:

image

Add contextual chunk headers

image

The purpose of the chunk header is to add context to the chunk text. Rather than using the chunk text by itself when embedding and reranking the chunk, we use the concatenation of the chunk header and the chunk text, as shown in the image above. This helps the ranking models (embeddings and rerankers) retrieve the correct chunks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Ready
Development

No branches or pull requests

2 participants