The ingest tracker drops updates if they come in too close together #4781

alexwlchan · 2020-09-17T14:07:13Z

When all the services are running and processing messages immediately, we've seen the ingest tracker drop updates, e.g.

service 1 sends "I'm done working"
service 2 sends "I'm starting up"

If those two messages arrive in quick succession, one of them gets dropped.

e.g. 49737f31-35f7-4d09-b63b-514f2a46b29c is missing its message from the bag versioner.

The error in the logs is extremely unhelpful and unenlightening in this regard:

uk.ac.wellcome.platform.storage.ingests_tracker.tracker.UpdateNonExistentIngestError: null

because we swallow the entire error.

The text was updated successfully, but these errors were encountered:

alexwlchan · 2020-09-17T14:12:48Z

I don't think we'll know the best way to fix this until we can see what's going on.

I've put up an initial patch which means we'll log the proper error (wellcomecollection/storage-service#738) – once we've got that merged, we can deploy it in staging and fire a stream of ingest updates into the tracker to see what happens.

alexwlchan · 2020-09-18T10:34:06Z

Looking in the logs, I saw the following sequence of events:

Ingest worker makes an HTTP PATCH to the ingest tracker ("Please update ingest XYZ")
The ingest tracker API returns an HTTP 409 Conflict
The ingest worker treats a Conflict as a deterministic error, and deletes the message

The ingest tracker API returns a Conflict if the underlying IngestTracker class returns an instance of StateConflictError, but I couldn't see why the DynamoIngestTracker we use would return that.

I wrote a small test to see what happens in the Dynamo tracker if you fire in lots of updates at the same time. Mostly I got Left(IngestStoreUnexpectedError: UpdateWriteError), but I saw one Left(UpdateNonExistentIngestError). That's interesting – DynamoDB is eventually consistent, and this error type is a StateConflictError.

So here's my new theory for what happens:

If multiple ingest updates arrive simultaneously, depending on consistency, DynamoDB returns either:

IngestStoreUnexpectedError. The ingest tracker API turns this into an HTTP 500, which the ingest worker treats as a non-deterministic failure and it sends the message back to the queue for retrying.
UpdateNonExistentIngestError. The ingest tracker API turns this into an HTTP 409, which the ingest worker treats as a deterministic failure and it deletes the message.

The latter only occurs rarely, subject to DynamoDB consistency, which is why we haven't seen this more often – but if we're putting a lot of load on DynamoDB (say, sending an update for every ingest in the storage service), you'd expect to see this more often.

I'm going to fix this by adding a new status code to the ingest tracker: if you try to update an ingest that doesn't exist, you get an HTTP 404 Not Found error, rather than a StateConflict.

alexwlchan added 🐛 Bug 📦 Storage service Work related to the storage service labels Sep 17, 2020

alexwlchan mentioned this issue Sep 17, 2020

Keep the full UpdateError in the ingests tracker wellcomecollection/storage-service#738

Closed

alexwlchan mentioned this issue Sep 18, 2020

If the ingest worker tries to update a non-existent ingest, retry wellcomecollection/storage-service#739

Merged

alexwlchan closed this as completed in wellcomecollection/storage-service#739 Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The ingest tracker drops updates if they come in too close together #4781

The ingest tracker drops updates if they come in too close together #4781

alexwlchan commented Sep 17, 2020

alexwlchan commented Sep 17, 2020

alexwlchan commented Sep 18, 2020

The ingest tracker drops updates if they come in too close together #4781

The ingest tracker drops updates if they come in too close together #4781

Comments

alexwlchan commented Sep 17, 2020

alexwlchan commented Sep 17, 2020

alexwlchan commented Sep 18, 2020