-
Notifications
You must be signed in to change notification settings - Fork 23
fix: improve error logging for ddbToEs sync #68
Conversation
); | ||
|
||
console.error('Failed to update ES records', e); | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this will result in some messages that were processed successfully being sent to the dlq, since a single message makes the batch fail and retrying the same batch will continue to fail.
It is common to use a batch size of 1 to workaround this issue. An alternative is to enable BisectBatchOnFunctionError
although I haven't used that setting before an I'm not sure about how it interacts with MaximumRetryAttempts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that is exactly the case some messages will succeed but the batch will fail if a single message fails.
I worry batch size of 1 may slow down our our sync too much. I looked into BisectBatch
but as you mentioned I was not sure how it works with MaxRetry
and I wasn't able to find documentation around it either. I suspect that it will Bisect at most MaxRetry times.
These writes are mostly idempotent, but there could be a use-case of a resource's availability switches due to this. ie 1) "AVAILIABLE" write fails and goes to DLQ, 2) "DELETE" write passes, 3) DLQ redrive changes the ES doc from DELTED -> AVAILIVABLE
A thing to note this DLQ redrive is a manual process and in reality I suspect that this operation would need a runbook laying out when to 'redrive' the DLQ and when not to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that guaranteeing that only the failed messages go to the DLQ is a very desirable property of the system. Otherwise ops become harder for customers for no good reason (why are there so many DLQ messages? How come only 6% of them actually failed? How can I know which of them actually failed?)
Another desirable property is handling out of order messages. Our current implementation does not do that(not the same as idempotency). It could be achieved by updating ES only if the vid of the incoming message is higher than the vid of the document in ES. This would make it safe to redrive DLQ messages. I think we can tackle this later as a separate issue.
IMO sending only the failed messages to the DLQ should be done now (can still be a different PR). I agree that BisectBatch
has scarce documentation, but is worth testing it out. Maybe MaxRetry=4
and BisectBatch=true
with our BatchSize=15
will effectively isolate the error to a single record.
The cheap alternative is MaxRetry=1
.
I worry batch size of 1 may slow down our our sync too much
My intuition tells me the same, but we need data in order to discard that approach
If we're making the change to throw an error, we should set the maxRetry to 1. Otherwise, if we fail to update ES because we have too many open sockets, then the retries will make it even worse. Also, out of curiosity, how will this link up with pushing items to the DLQ. Is there a mechanism that will automatically push the record of failed requests to the DLQ? |
In reply to @nguyen102:
In the case of ES throttling we would want to retry right? Perhaps it is better for the lambda itself to do retries?
The EventStream tries |
Ah, got it. I assume the change for the DLQ will be in the |
|
I think this would be a separate PR? This PR is focusing on improving logging -- happy to discuss otherwise too |
Yep, agreed. I'm ok with it being in a separate PR. Until then should we change the |
IMO we have a correct setup right now. Lambda code has zero retries, and the
Why? It's very likely that #69 is the cause of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving since this is a step in the right direction in terms of error logging.
It may be a good idea to sync up offline about the retry & DLQ setup since there are many comments about it.
Oh I see, assuming #69 is the solution to the I'm ok, with approving this PR and merging in #69. We can gather more data after this and figure out best step forward. |
Issue #, if available: #18, awslabs/fhir-works-on-aws-search-es#21
The goal with this PR is:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.