refactor: Rework `Document.id` generation #6122

silvanocerza · 2023-10-19T11:32:34Z

Proposed Changes:

Change how Document.id is generated when not explicitly set.
Now it uses all Document fields.

Document.id_hash_keys is not used anymore but it will removed in a later PR as there are several Components that sest it.

How did you test it?

Ran unit tests.

Notes for the reviewer

This is the first of a series of PRs to rework Document.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

julian-risch · 2023-10-19T11:41:46Z

Could you please explain the consequences of not using Document.id_hash_keys anymore? As a user, how would I make sure that two Documents that have the same values for all fields except a metadata field timestamp are treated as duplicates? For example, the older Document is already indexed and then I write the newer one to the DocumentStore too and expect the old Document to be overwritten.

silvanocerza · 2023-10-19T11:49:53Z

@julian-risch you calculate your own id and create the Document with it, at that point we don't generate one.

id_hash_keys was necessary cause trying to create a Document by setting the id was impossible in the first iteration as it was completely ignored. Changing it explicitly wasn't feasible either since the dataclass was frozen too.

Now that both have been changed we can change set the id however we please, either calculating it beforehand or after.

It also gives much more freedom to the user as they might have different requirements for the id. Using id_hash_keys forces the user to our own way.

julian-risch · 2023-10-19T11:57:46Z

For making the migration of indexing pipelines from 1.x to 2.x as simple as possible, a DocumentIDGenerator could become a Haystack component then? One that you could use just before the DocumentWriter. Or do you have some other plan in mind?

silvanocerza · 2023-10-19T12:11:58Z

There are several solutions I think, first one that comes to mind would be handling this in the converters. Not sure whether it's the best though.

Probably we're also going to have a converter from old Document to the new one, so it could be handled there too.

In any case I don't see this as a blocker.

ZanSara · 2023-10-19T14:39:58Z

I have to agree with @julian-risch here, in the sense that I'm not sure what's going on. When did we decide to remove id_hash_keys?

id_hash_keys was necessary cause trying to create a Document by setting the id was impossible in the first iteration as it was completely ignored. Changing it explicitly wasn't feasible either since the dataclass was frozen too.

id_hash_keys has many other uses. It helps users include the values of important metadata in the ID to allow Documents with duplicate contents where they are needed.

Consider for example a system where Documents have a owner metadata field. Regardless of the content, I may want documents with different owners to never count as duplicates. In this sense id_hash_keys works perfectly: by adding owner to the keys the problem is solved. The same may be done with name (as in source file name) to avoid losing parts of a larger documents if it appears to be duplicated in another. There usecases are just on top of my head, I'm sure there are others. Did we take any of this into account before moving on with this change? How would users manage to do so? Right now IDs are created in many places while indexing. It would be complicated to use an additional node in my opinion.

If you did, at least let's have this reasoning documented somewhere in detail (like in this PR description) for future reference. Right now this comes very much out of the blue for me and I don't see the advantage of it.

julian-risch

Thanks for the explanations. I am convinced it will be easy to later allow users to generate ids based on a custom list of fields so that use case is no blocker and the changes in this PR are low risk. I just have a few small change requests in the comments below and then it's ready to go.
Let's also adapt the docstring for id_hash_keys and mention that it is ignored. I understand that other PRs like #6125 need to be merged before id_hash_keys can be removed completely. Splitting the work in multiple smaller PRs helps with reviewing and I appreciate that. 👍 It would have further helped to have a list of related PRs or an epic with an overview of the issues in the PR description to quickly understand the plan.

julian-risch · 2023-10-19T14:11:03Z

haystack/preview/dataclasses/document.py

+        metadata = self.metadata or {}
+        score = self.score if self.score is not None else None
+        embedding = self.embedding.tolist() if self.embedding is not None else None
+        data = f"{text}{array}{dataframe}{blob}{mime_type}{metadata}{score}{embedding}"


Let's leave out score from the id generation. If the same document is returned from two different retrievers in hybrid retrieval it can have two different scores but should be treated as the same document.

julian-risch · 2023-10-19T14:15:46Z

releasenotes/notes/doc-id-rework-85e82b5359282f7e.yaml

+preview:
+  - |
+    Rework `Document.id` generation, if an `id` is not explicitly set it's generated
+    using all `Document` fields


Let's mention that id_hash_keys is ignored.

julian-risch · 2023-10-19T14:23:26Z

test/preview/components/retrievers/test_in_memory_embedding_retriever.py

@@ -128,7 +129,7 @@ def test_valid_run(self):

        assert "documents" in result
        assert len(result["documents"]) == top_k
-        assert result["documents"][0].embedding == [1.0, 1.0, 1.0, 1.0]
+        assert (result["documents"][0].embedding == [1.0, 1.0, 1.0, 1.0]).all()


Let's use assert np.array_equal(result["documents"][0].embedding, [1.0, 1.0, 1.0, 1.0]) instead. The issue with .all() for comparisons is that it behavior differs if an array is empty. For example:
(np.arange(1) == []).all() results in True.

julian-risch · 2023-10-19T14:24:07Z

test/preview/components/retrievers/test_in_memory_embedding_retriever.py

        )

        assert result
        assert "retriever" in result
        results_docs = result["retriever"]["documents"]
        assert results_docs
        assert len(results_docs) == top_k
-        assert results_docs[0].embedding == [1.0, 1.0, 1.0, 1.0]
+        assert (results_docs[0].embedding == [1.0, 1.0, 1.0, 1.0]).all()


assert np.array_equal(results_docs[0].embedding, [1.0, 1.0, 1.0, 1.0])

julian-risch · 2023-10-19T14:31:27Z

test/preview/document_stores/test_in_memory.py

@@ -256,12 +256,12 @@ def test_embedding_retrieval(self):
        docstore = InMemoryDocumentStore(embedding_similarity_function="cosine")
        # Tests if the embedding retrieval method returns the correct document based on the input query embedding.
        docs = [
-            Document(text="Hello world", embedding=[0.1, 0.2, 0.3, 0.4]),
-            Document(text="Haystack supports multiple languages", embedding=[1.0, 1.0, 1.0, 1.0]),
+            Document(text="Hello world", embedding=np.array(np.array([0.1, 0.2, 0.3, 0.4]))),


Let's remove the duplicate np.array(...)

julian-risch

LGTM! 👍

silvanocerza added 3 commits October 19, 2023 13:25

Rework Document id generation

bbc30c9

Fix tests

1d43569

Add release notes

87311ea

silvanocerza added the 2.x Related to Haystack v2.0 label Oct 19, 2023

silvanocerza self-assigned this Oct 19, 2023

silvanocerza requested a review from a team as a code owner October 19, 2023 11:32

silvanocerza requested review from julian-risch and removed request for a team October 19, 2023 11:32

silvanocerza changed the title ~~Rework Document.id generation~~ refactor: Rework Document.id generation Oct 19, 2023

github-actions bot added the topic:tests label Oct 19, 2023

silvanocerza requested a review from a team as a code owner October 19, 2023 11:36

silvanocerza requested review from dfokina and removed request for a team October 19, 2023 11:36

Fix failing integration test

605c367

silvanocerza mentioned this pull request Oct 19, 2023

refactor: Remove id_hash_keys from DocumentCleaner #6123

Merged

julian-risch requested changes Oct 19, 2023

View reviewed changes

silvanocerza added 3 commits October 19, 2023 17:00

Remove score from Document id generation

722f4a8

Enhance tests

0a47962

Update release notes

f450291

julian-risch approved these changes Oct 20, 2023

View reviewed changes

Merge branch 'main' into doc-id-rework

59a3600

silvanocerza merged commit 3f98bd9 into main Oct 20, 2023

silvanocerza deleted the doc-id-rework branch October 20, 2023 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Rework `Document.id` generation #6122

refactor: Rework `Document.id` generation #6122

silvanocerza commented Oct 19, 2023

julian-risch commented Oct 19, 2023

silvanocerza commented Oct 19, 2023

julian-risch commented Oct 19, 2023

silvanocerza commented Oct 19, 2023

ZanSara commented Oct 19, 2023

julian-risch left a comment

julian-risch Oct 19, 2023

julian-risch Oct 19, 2023

julian-risch Oct 19, 2023

julian-risch Oct 19, 2023

julian-risch Oct 19, 2023

julian-risch left a comment

refactor: Rework Document.id generation #6122

refactor: Rework Document.id generation #6122

Conversation

silvanocerza commented Oct 19, 2023

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

julian-risch commented Oct 19, 2023

silvanocerza commented Oct 19, 2023

julian-risch commented Oct 19, 2023

silvanocerza commented Oct 19, 2023

ZanSara commented Oct 19, 2023

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Oct 19, 2023

Choose a reason for hiding this comment

julian-risch Oct 19, 2023

Choose a reason for hiding this comment

julian-risch Oct 19, 2023

Choose a reason for hiding this comment

julian-risch Oct 19, 2023

Choose a reason for hiding this comment

julian-risch Oct 19, 2023

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

refactor: Rework `Document.id` generation #6122

refactor: Rework `Document.id` generation #6122