-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling duplicates #34
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Do you think it's worth adding a flag to __init__
ignore_duplicates
so that this is configurable and can be run in a "strict" mode for debugging purposes?
I agree with this. Having the check is better than just dropping the document, but I am still very confused by why we are getting multiple inserts and worry that it is indicative of a much more fundamental problem that needs to be run down. |
AFAIK we have never actually seen multiple inserts. But we know that it's theoretically possible for a Kafka consumer process to fail between successful insertion of a document and confirming to Kafka that the document has been successfully processed and need not be re-sent ("committing"). If we start seeing any log messages from this, we'll investigate, but I would expect to never see this in practice. When this is used in-process with an RunEngine, it should fail hard if it receives a duplicate, but when this is used downstream of a message bus that is designed to err on the side of retrying, it should tolerate duplicates in this fashion but log them at a high level to prompt a human to investigate. |
An important note here is that, resource uid index has been made unique. There are one or two beamlines that may have resource uids that are not unique. The plan is to require them to update their profile to only make resources with unique iud. |
👍 for paranoia! |
It's important to be clear that this won't work we need to generate unique uids going forward and update their databases to remove all instances of duplicates uids in the past. The first is comparatively easy (and, I think, already taken care of everywhere). The second will require significant effort and great care. |
This conflicts with indexing requirements imposed by databroker.v0. We need old versions of databroker.v0 to work against the same database as this does.
Fixes: #33