Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[patch]: Update ElasticSearch mappings to successfully add documents from TextSplitter #3629

Merged
merged 8 commits into from
Dec 15, 2023
19 changes: 15 additions & 4 deletions libs/langchain-community/src/vectorstores/elasticsearch.ts
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,11 @@ export class ElasticVectorSearch extends VectorStore {
text: documents[idx].pageContent,
},
]);
await this.client.bulk({ refresh: true, operations });
const results = await this.client.bulk({ refresh: true, operations });
if (results.errors) {
let reasons = results.items.map((result) => result.index?.error?.reason);
throw new Error(`Failed to insert documents:\n${reasons.join("\n")}`);
}
return documentIds;
}

Expand Down Expand Up @@ -266,16 +270,23 @@ export class ElasticVectorSearch extends VectorStore {
mappings: {
dynamic_templates: [
{
// map all metadata properties to be keyword
"metadata.*": {
// map all metadata properties to be keyword except loc
metadata_except_loc: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there we can generalize this a bit more to handle all object properties?

Copy link
Contributor Author

@mattraibert mattraibert Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!

One reason I did it this way is that I tried to be as focused as possible on fixing the bug and leaving other existing behavior alone.

We could possibly use object type for every metadata field and perhaps it's the right thing to do given that metadata is defined as Record<string, any>. I think it would be like this:

            "metadata.*": {
              match_mapping_type: "*",
              match: "metadata.*",
              mapping: { type: "object" },
            },

Edit: on reviewing Elasticsearch docs, you can't store numbers or strings in an object field so this would lead to a similar problem to the original bug.

But I think there are a couple of reasonable arguments for keeping metadata fields declared as keyword in Elasticsearch. Searching through the codebase, I see that loc is the only field that is ever anything but a flat number or string. And I think being able to filter by keyword type fields is much simpler and more performant in Elasticsearch. For my use case, for example, I want to augment an embedding based search with keyword filtering, so I'd prefer to keep them as keywords.

Another solution to #2857 could be to change the way loc is stored so that it can be stored in a keyword field. The TextSplitters could store something like line_from and line_to. In that case, it might make sense to update the type of metadata as well. This might be a bigger lift since it would potentially affect anything downstream of those TextSplitters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can't change text splitter metadata yet, though I think flattening it would be nice when we're ready to make a breaking change.

Copy link
Collaborator

@jacoblee93 jacoblee93 Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this but can we add a TODO to remove in the code when it's time for breaking changes? Also remember to run yarn lint!

match_mapping_type: "*",
match: "metadata.*",
unmatch: "metadata.loc",
mapping: { type: "keyword" },
},
},
],
properties: {
text: { type: "text" },
metadata: { type: "object" },
metadata: {
type: "object",
properties: {
loc: { type: "object" }, // explicitly define loc as an object
},
},
embedding: {
type: "dense_vector",
dims: dimension,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,4 +107,30 @@ describe("ElasticVectorSearch", () => {
const results = await store.similaritySearch("*", 11);
expect(results).toHaveLength(11);
});

test.skip("ElasticVectorSearch integration with text splitting metadata", async () => {
const createdAt = new Date().getTime();
const documents = [
new Document({
pageContent: "hello",
metadata: { a: createdAt, loc: { lines: { from: 1, to: 1 } } },
}),
new Document({
pageContent: "car",
metadata: { a: createdAt, loc: { lines: { from: 2, to: 2 } } },
}),
];

await store.addDocuments(documents);

const results1 = await store.similaritySearch("hello!", 1);

expect(results1).toHaveLength(1);
expect(results1).toEqual([
new Document({
metadata: { a: createdAt, loc: { lines: { from: 1, to: 1 } } },
pageContent: "hello",
}),
]);
});
});