-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Rework Document.id
generation
#6122
Conversation
Document.id
generationDocument.id
generation
Could you please explain the consequences of not using |
@julian-risch you calculate your own id and create the
Now that both have been changed we can change set the It also gives much more freedom to the user as they might have different requirements for the |
For making the migration of indexing pipelines from 1.x to 2.x as simple as possible, a DocumentIDGenerator could become a Haystack component then? One that you could use just before the DocumentWriter. Or do you have some other plan in mind? |
There are several solutions I think, first one that comes to mind would be handling this in the Probably we're also going to have a converter from old In any case I don't see this as a blocker. |
I have to agree with @julian-risch here, in the sense that I'm not sure what's going on. When did we decide to remove
Consider for example a system where Documents have a If you did, at least let's have this reasoning documented somewhere in detail (like in this PR description) for future reference. Right now this comes very much out of the blue for me and I don't see the advantage of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations. I am convinced it will be easy to later allow users to generate ids based on a custom list of fields so that use case is no blocker and the changes in this PR are low risk. I just have a few small change requests in the comments below and then it's ready to go.
Let's also adapt the docstring for id_hash_keys
and mention that it is ignored. I understand that other PRs like #6125 need to be merged before id_hash_keys
can be removed completely. Splitting the work in multiple smaller PRs helps with reviewing and I appreciate that. 👍 It would have further helped to have a list of related PRs or an epic with an overview of the issues in the PR description to quickly understand the plan.
metadata = self.metadata or {} | ||
score = self.score if self.score is not None else None | ||
embedding = self.embedding.tolist() if self.embedding is not None else None | ||
data = f"{text}{array}{dataframe}{blob}{mime_type}{metadata}{score}{embedding}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave out score
from the id generation. If the same document is returned from two different retrievers in hybrid retrieval it can have two different scores but should be treated as the same document.
preview: | ||
- | | ||
Rework `Document.id` generation, if an `id` is not explicitly set it's generated | ||
using all `Document` fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention that id_hash_keys
is ignored.
@@ -128,7 +129,7 @@ def test_valid_run(self): | |||
|
|||
assert "documents" in result | |||
assert len(result["documents"]) == top_k | |||
assert result["documents"][0].embedding == [1.0, 1.0, 1.0, 1.0] | |||
assert (result["documents"][0].embedding == [1.0, 1.0, 1.0, 1.0]).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use assert np.array_equal(result["documents"][0].embedding, [1.0, 1.0, 1.0, 1.0])
instead. The issue with .all()
for comparisons is that it behavior differs if an array is empty. For example:
(np.arange(1) == []).all()
results in True
.
) | ||
|
||
assert result | ||
assert "retriever" in result | ||
results_docs = result["retriever"]["documents"] | ||
assert results_docs | ||
assert len(results_docs) == top_k | ||
assert results_docs[0].embedding == [1.0, 1.0, 1.0, 1.0] | ||
assert (results_docs[0].embedding == [1.0, 1.0, 1.0, 1.0]).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert np.array_equal(results_docs[0].embedding, [1.0, 1.0, 1.0, 1.0])
@@ -256,12 +256,12 @@ def test_embedding_retrieval(self): | |||
docstore = InMemoryDocumentStore(embedding_similarity_function="cosine") | |||
# Tests if the embedding retrieval method returns the correct document based on the input query embedding. | |||
docs = [ | |||
Document(text="Hello world", embedding=[0.1, 0.2, 0.3, 0.4]), | |||
Document(text="Haystack supports multiple languages", embedding=[1.0, 1.0, 1.0, 1.0]), | |||
Document(text="Hello world", embedding=np.array(np.array([0.1, 0.2, 0.3, 0.4]))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the duplicate np.array(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Proposed Changes:
Change how
Document.id
is generated when not explicitly set.Now it uses all
Document
fields.Document.id_hash_keys
is not used anymore but it will removed in a later PR as there are several Components that sest it.How did you test it?
Ran unit tests.
Notes for the reviewer
This is the first of a series of PRs to rework
Document
.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.