Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling duplicates #34

Merged
merged 17 commits into from
Aug 31, 2020
Merged

Handling duplicates #34

merged 17 commits into from
Aug 31, 2020

Conversation

gwbischof
Copy link
Contributor

@gwbischof gwbischof commented Aug 13, 2020

Fixes: #33

requirements-dev.txt Outdated Show resolved Hide resolved
@gwbischof gwbischof marked this pull request as ready for review August 28, 2020 20:38
Copy link
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Do you think it's worth adding a flag to __init__ ignore_duplicates so that this is configurable and can be run in a "strict" mode for debugging purposes?

suitcase/mongo_normalized/__init__.py Show resolved Hide resolved
suitcase/mongo_normalized/__init__.py Outdated Show resolved Hide resolved
@tacaswell
Copy link
Contributor

Do you think it's worth adding a flag to init ignore_duplicates so that this is configurable and can be run in a "strict" mode for debugging purposes?

I agree with this.

Having the check is better than just dropping the document, but I am still very confused by why we are getting multiple inserts and worry that it is indicative of a much more fundamental problem that needs to be run down.

@danielballan
Copy link
Member

I am still very confused by why we are getting multiple inserts and worry that it is indicative of a much more fundamental problem that needs to be run down.

AFAIK we have never actually seen multiple inserts. But we know that it's theoretically possible for a Kafka consumer process to fail between successful insertion of a document and confirming to Kafka that the document has been successfully processed and need not be re-sent ("committing"). If we start seeing any log messages from this, we'll investigate, but I would expect to never see this in practice.

When this is used in-process with an RunEngine, it should fail hard if it receives a duplicate, but when this is used downstream of a message bus that is designed to err on the side of retrying, it should tolerate duplicates in this fashion but log them at a high level to prompt a human to investigate.

@gwbischof
Copy link
Contributor Author

An important note here is that, resource uid index has been made unique. There are one or two beamlines that may have resource uids that are not unique. The plan is to require them to update their profile to only make resources with unique iud.
This change was needed to dedupe resource documents.

@tacaswell
Copy link
Contributor

👍 for paranoia!

@danielballan
Copy link
Member

The plan is to require them to update their profile to only make resources with unique iud.

It's important to be clear that this won't work we need to generate unique uids going forward and update their databases to remove all instances of duplicates uids in the past. The first is comparatively easy (and, I think, already taken care of everywhere). The second will require significant effort and great care.

@danielballan danielballan merged commit c154220 into bluesky:master Aug 31, 2020
danielballan added a commit to danielballan/suitcase-mongo that referenced this pull request Dec 18, 2020
This conflicts with indexing requirements imposed by databroker.v0.
We need old versions of databroker.v0 to work against the same database
as this does.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle duplicate documents.
3 participants