Handling duplicates #34

gwbischof · 2020-08-13T20:49:47Z

Fixes: #33

requirements-dev.txt

danielballan

Looks good. Do you think it's worth adding a flag to __init__ ignore_duplicates so that this is configurable and can be run in a "strict" mode for debugging purposes?

suitcase/mongo_normalized/__init__.py

tacaswell · 2020-08-31T14:01:02Z

Do you think it's worth adding a flag to init ignore_duplicates so that this is configurable and can be run in a "strict" mode for debugging purposes?

I agree with this.

Having the check is better than just dropping the document, but I am still very confused by why we are getting multiple inserts and worry that it is indicative of a much more fundamental problem that needs to be run down.

danielballan · 2020-08-31T15:30:27Z

I am still very confused by why we are getting multiple inserts and worry that it is indicative of a much more fundamental problem that needs to be run down.

AFAIK we have never actually seen multiple inserts. But we know that it's theoretically possible for a Kafka consumer process to fail between successful insertion of a document and confirming to Kafka that the document has been successfully processed and need not be re-sent ("committing"). If we start seeing any log messages from this, we'll investigate, but I would expect to never see this in practice.

When this is used in-process with an RunEngine, it should fail hard if it receives a duplicate, but when this is used downstream of a message bus that is designed to err on the side of retrying, it should tolerate duplicates in this fashion but log them at a high level to prompt a human to investigate.

gwbischof · 2020-08-31T16:13:12Z

An important note here is that, resource uid index has been made unique. There are one or two beamlines that may have resource uids that are not unique. The plan is to require them to update their profile to only make resources with unique iud.
This change was needed to dedupe resource documents.

tacaswell · 2020-08-31T16:37:52Z

👍 for paranoia!

danielballan · 2020-08-31T16:54:31Z

The plan is to require them to update their profile to only make resources with unique iud.

It's important to be clear that this won't work we need to generate unique uids going forward and update their databases to remove all instances of duplicates uids in the past. The first is comparatively easy (and, I think, already taken care of everywhere). The second will require significant effort and great care.

This conflicts with indexing requirements imposed by databroker.v0. We need old versions of databroker.v0 to work against the same database as this does.

add pyPDB to requirements

13b63ad

danielballan reviewed Aug 13, 2020

View reviewed changes

requirements-dev.txt Outdated Show resolved Hide resolved

gwbischof added 12 commits August 28, 2020 13:33

handle start duplicates

46db3e0

cat the DuplicateKeyError in the other methods

aa4c3c0

remove descending sort on uid index

1db19e7

remove descending sort on descriptor index

e89f20e

make resource uid index unique

2c706b1

add logging

3ec3757

remove logging

48d3b27

mongo insert adds _id, pop the _id

79d0412

make sure we get an exception if the documents are different

cf661f5

add comment

6b94681

mongo embedded resource.uid unique

ec39962

remove pyPDB req

f4d0301

gwbischof requested review from danielballan and jklynch August 28, 2020 20:06

flake8 fixes

5464504

gwbischof marked this pull request as ready for review August 28, 2020 20:38

danielballan reviewed Aug 28, 2020

View reviewed changes

suitcase/mongo_normalized/__init__.py Show resolved Hide resolved

suitcase/mongo_normalized/__init__.py Outdated Show resolved Hide resolved

add _insert method

bec1608

gwbischof requested a review from danielballan August 28, 2020 23:14

gwbischof added 2 commits August 31, 2020 12:22

add ignore duplicates

9d0d57b

update docstring

864641a

danielballan merged commit c154220 into bluesky:master Aug 31, 2020

gwbischof mentioned this pull request Sep 17, 2020

Make create_indexes independent #37

Open

danielballan mentioned this pull request Dec 18, 2020

Duplicate handling followup #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling duplicates #34

Handling duplicates #34

gwbischof commented Aug 13, 2020 •

edited

Loading

danielballan left a comment

tacaswell commented Aug 31, 2020

danielballan commented Aug 31, 2020

gwbischof commented Aug 31, 2020

tacaswell commented Aug 31, 2020

danielballan commented Aug 31, 2020

Handling duplicates #34

Handling duplicates #34

Conversation

gwbischof commented Aug 13, 2020 • edited Loading

danielballan left a comment

Choose a reason for hiding this comment

tacaswell commented Aug 31, 2020

danielballan commented Aug 31, 2020

gwbischof commented Aug 31, 2020

tacaswell commented Aug 31, 2020

danielballan commented Aug 31, 2020

gwbischof commented Aug 13, 2020 •

edited

Loading