Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Rework Document.id generation #6122

Merged
merged 8 commits into from
Oct 20, 2023
Merged

refactor: Rework Document.id generation #6122

merged 8 commits into from
Oct 20, 2023

Conversation

silvanocerza
Copy link
Contributor

Proposed Changes:

Change how Document.id is generated when not explicitly set.
Now it uses all Document fields.

Document.id_hash_keys is not used anymore but it will removed in a later PR as there are several Components that sest it.

How did you test it?

Ran unit tests.

Notes for the reviewer

This is the first of a series of PRs to rework Document.

Checklist

@silvanocerza silvanocerza added the 2.x Related to Haystack v2.0 label Oct 19, 2023
@silvanocerza silvanocerza self-assigned this Oct 19, 2023
@silvanocerza silvanocerza requested a review from a team as a code owner October 19, 2023 11:32
@silvanocerza silvanocerza requested review from julian-risch and removed request for a team October 19, 2023 11:32
@silvanocerza silvanocerza changed the title Rework Document.id generation refactor: Rework Document.id generation Oct 19, 2023
@silvanocerza silvanocerza requested a review from a team as a code owner October 19, 2023 11:36
@silvanocerza silvanocerza requested review from dfokina and removed request for a team October 19, 2023 11:36
@julian-risch
Copy link
Member

Could you please explain the consequences of not using Document.id_hash_keys anymore? As a user, how would I make sure that two Documents that have the same values for all fields except a metadata field timestamp are treated as duplicates? For example, the older Document is already indexed and then I write the newer one to the DocumentStore too and expect the old Document to be overwritten.

@silvanocerza
Copy link
Contributor Author

@julian-risch you calculate your own id and create the Document with it, at that point we don't generate one.

id_hash_keys was necessary cause trying to create a Document by setting the id was impossible in the first iteration as it was completely ignored. Changing it explicitly wasn't feasible either since the dataclass was frozen too.

Now that both have been changed we can change set the id however we please, either calculating it beforehand or after.

It also gives much more freedom to the user as they might have different requirements for the id. Using id_hash_keys forces the user to our own way.

@julian-risch
Copy link
Member

For making the migration of indexing pipelines from 1.x to 2.x as simple as possible, a DocumentIDGenerator could become a Haystack component then? One that you could use just before the DocumentWriter. Or do you have some other plan in mind?

@silvanocerza
Copy link
Contributor Author

There are several solutions I think, first one that comes to mind would be handling this in the converters. Not sure whether it's the best though.

Probably we're also going to have a converter from old Document to the new one, so it could be handled there too.

In any case I don't see this as a blocker.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 19, 2023

I have to agree with @julian-risch here, in the sense that I'm not sure what's going on. When did we decide to remove id_hash_keys?

id_hash_keys was necessary cause trying to create a Document by setting the id was impossible in the first iteration as it was completely ignored. Changing it explicitly wasn't feasible either since the dataclass was frozen too.

id_hash_keys has many other uses. It helps users include the values of important metadata in the ID to allow Documents with duplicate contents where they are needed.

Consider for example a system where Documents have a owner metadata field. Regardless of the content, I may want documents with different owners to never count as duplicates. In this sense id_hash_keys works perfectly: by adding owner to the keys the problem is solved. The same may be done with name (as in source file name) to avoid losing parts of a larger documents if it appears to be duplicated in another. There usecases are just on top of my head, I'm sure there are others. Did we take any of this into account before moving on with this change? How would users manage to do so? Right now IDs are created in many places while indexing. It would be complicated to use an additional node in my opinion.

If you did, at least let's have this reasoning documented somewhere in detail (like in this PR description) for future reference. Right now this comes very much out of the blue for me and I don't see the advantage of it.

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations. I am convinced it will be easy to later allow users to generate ids based on a custom list of fields so that use case is no blocker and the changes in this PR are low risk. I just have a few small change requests in the comments below and then it's ready to go.
Let's also adapt the docstring for id_hash_keys and mention that it is ignored. I understand that other PRs like #6125 need to be merged before id_hash_keys can be removed completely. Splitting the work in multiple smaller PRs helps with reviewing and I appreciate that. 👍 It would have further helped to have a list of related PRs or an epic with an overview of the issues in the PR description to quickly understand the plan.

metadata = self.metadata or {}
score = self.score if self.score is not None else None
embedding = self.embedding.tolist() if self.embedding is not None else None
data = f"{text}{array}{dataframe}{blob}{mime_type}{metadata}{score}{embedding}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave out score from the id generation. If the same document is returned from two different retrievers in hybrid retrieval it can have two different scores but should be treated as the same document.

preview:
- |
Rework `Document.id` generation, if an `id` is not explicitly set it's generated
using all `Document` fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention that id_hash_keys is ignored.

@@ -128,7 +129,7 @@ def test_valid_run(self):

assert "documents" in result
assert len(result["documents"]) == top_k
assert result["documents"][0].embedding == [1.0, 1.0, 1.0, 1.0]
assert (result["documents"][0].embedding == [1.0, 1.0, 1.0, 1.0]).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use assert np.array_equal(result["documents"][0].embedding, [1.0, 1.0, 1.0, 1.0]) instead. The issue with .all() for comparisons is that it behavior differs if an array is empty. For example:
(np.arange(1) == []).all() results in True.

)

assert result
assert "retriever" in result
results_docs = result["retriever"]["documents"]
assert results_docs
assert len(results_docs) == top_k
assert results_docs[0].embedding == [1.0, 1.0, 1.0, 1.0]
assert (results_docs[0].embedding == [1.0, 1.0, 1.0, 1.0]).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert np.array_equal(results_docs[0].embedding, [1.0, 1.0, 1.0, 1.0])

@@ -256,12 +256,12 @@ def test_embedding_retrieval(self):
docstore = InMemoryDocumentStore(embedding_similarity_function="cosine")
# Tests if the embedding retrieval method returns the correct document based on the input query embedding.
docs = [
Document(text="Hello world", embedding=[0.1, 0.2, 0.3, 0.4]),
Document(text="Haystack supports multiple languages", embedding=[1.0, 1.0, 1.0, 1.0]),
Document(text="Hello world", embedding=np.array(np.array([0.1, 0.2, 0.3, 0.4]))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the duplicate np.array(...)

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@silvanocerza silvanocerza merged commit 3f98bd9 into main Oct 20, 2023
@silvanocerza silvanocerza deleted the doc-id-rework branch October 20, 2023 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants