-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community[patch]: Update ElasticSearch mappings to successfully add documents from TextSplitter #3629
Merged
Merged
community[patch]: Update ElasticSearch mappings to successfully add documents from TextSplitter #3629
Changes from 4 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
06a8aaa
failing test that shows how the loc format from text splitters confli…
mattraibert 877cfb9
explicitly declare .metadata.loc as an object in elasticsearch
mattraibert 9a6dd70
Throw an error if inserting vectors into elasticsearch fails.
mattraibert 1b6c124
Merge branch 'main' into es-loc-bug
bracesproul 5efe680
Merge branch 'main' of https://github.com/hwchase17/langchainjs into …
jacoblee93 2686b7c
Lint + docs format
jacoblee93 9abe474
Update docs
jacoblee93 7b228ff
Format
jacoblee93 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there we can generalize this a bit more to handle all object properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing!
One reason I did it this way is that I tried to be as focused as possible on fixing the bug and leaving other existing behavior alone.
We could possibly use object type for every metadata field and perhaps it's the right thing to do given thatmetadata
is defined asRecord<string, any>
. I think it would be like this:Edit: on reviewing Elasticsearch docs, you can't store numbers or strings in an object field so this would lead to a similar problem to the original bug.
But I think there are a couple of reasonable arguments for keeping
metadata
fields declared askeyword
in Elasticsearch. Searching through the codebase, I see thatloc
is the only field that is ever anything but a flat number or string. And I think being able to filter bykeyword
type fields is much simpler and more performant in Elasticsearch. For my use case, for example, I want to augment an embedding based search with keyword filtering, so I'd prefer to keep them as keywords.Another solution to #2857 could be to change the way
loc
is stored so that it can be stored in a keyword field. TheTextSplitter
s could store something likeline_from
andline_to
. In that case, it might make sense to update the type ofmetadata
as well. This might be a bigger lift since it would potentially affect anything downstream of thoseTextSplitter
s.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can't change text splitter metadata yet, though I think flattening it would be nice when we're ready to make a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this but can we add a TODO to remove in the code when it's time for breaking changes? Also remember to run
yarn lint
!