Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ingest tracker drops updates if they come in too close together #4781

Closed
alexwlchan opened this issue Sep 17, 2020 · 2 comments · Fixed by wellcomecollection/storage-service#739
Labels
🐛 Bug 📦 Storage service Work related to the storage service

Comments

@alexwlchan
Copy link
Contributor

When all the services are running and processing messages immediately, we've seen the ingest tracker drop updates, e.g.

  • service 1 sends "I'm done working"
  • service 2 sends "I'm starting up"

If those two messages arrive in quick succession, one of them gets dropped.

e.g. 49737f31-35f7-4d09-b63b-514f2a46b29c is missing its message from the bag versioner.

The error in the logs is extremely unhelpful and unenlightening in this regard:

uk.ac.wellcome.platform.storage.ingests_tracker.tracker.UpdateNonExistentIngestError: null

because we swallow the entire error.

@alexwlchan
Copy link
Contributor Author

I don't think we'll know the best way to fix this until we can see what's going on.

I've put up an initial patch which means we'll log the proper error (wellcomecollection/storage-service#738) – once we've got that merged, we can deploy it in staging and fire a stream of ingest updates into the tracker to see what happens.

@alexwlchan
Copy link
Contributor Author

Looking in the logs, I saw the following sequence of events:

  • Ingest worker makes an HTTP PATCH to the ingest tracker ("Please update ingest XYZ")
  • The ingest tracker API returns an HTTP 409 Conflict
  • The ingest worker treats a Conflict as a deterministic error, and deletes the message

The ingest tracker API returns a Conflict if the underlying IngestTracker class returns an instance of StateConflictError, but I couldn't see why the DynamoIngestTracker we use would return that.

I wrote a small test to see what happens in the Dynamo tracker if you fire in lots of updates at the same time. Mostly I got Left(IngestStoreUnexpectedError: UpdateWriteError), but I saw one Left(UpdateNonExistentIngestError). That's interesting – DynamoDB is eventually consistent, and this error type is a StateConflictError.


So here's my new theory for what happens:

If multiple ingest updates arrive simultaneously, depending on consistency, DynamoDB returns either:

  • IngestStoreUnexpectedError. The ingest tracker API turns this into an HTTP 500, which the ingest worker treats as a non-deterministic failure and it sends the message back to the queue for retrying.
  • UpdateNonExistentIngestError. The ingest tracker API turns this into an HTTP 409, which the ingest worker treats as a deterministic failure and it deletes the message.

The latter only occurs rarely, subject to DynamoDB consistency, which is why we haven't seen this more often – but if we're putting a lot of load on DynamoDB (say, sending an update for every ingest in the storage service), you'd expect to see this more often.

I'm going to fix this by adding a new status code to the ingest tracker: if you try to update an ingest that doesn't exist, you get an HTTP 404 Not Found error, rather than a StateConflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug 📦 Storage service Work related to the storage service
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant