fix: improve error logging for ddbToEs sync #68

rsmayda · 2021-04-22T02:53:02Z

Issue #, if available: #18, awslabs/fhir-works-on-aws-search-es#21

The goal with this PR is:

throw an error if the lambda can't process (causing retries/failures)
if errors do occur put in the logs:
1. the resources that were 'bad'
2. all the resources in this batch that may not have been processed

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

src/ddbToEs/index.ts

carvantes · 2021-04-22T17:59:42Z

src/ddbToEs/index.ts

+        );
+
+        console.error('Failed to update ES records', e);
+        throw e;


I think that this will result in some messages that were processed successfully being sent to the dlq, since a single message makes the batch fail and retrying the same batch will continue to fail.

It is common to use a batch size of 1 to workaround this issue. An alternative is to enable BisectBatchOnFunctionError although I haven't used that setting before an I'm not sure about how it interacts with MaximumRetryAttempts

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-eventsourcemapping.html#cfn-lambda-eventsourcemapping-bisectbatchonfunctionerror

Yep that is exactly the case some messages will succeed but the batch will fail if a single message fails.

I worry batch size of 1 may slow down our our sync too much. I looked into BisectBatch but as you mentioned I was not sure how it works with MaxRetry and I wasn't able to find documentation around it either. I suspect that it will Bisect at most MaxRetry times.

These writes are mostly idempotent, but there could be a use-case of a resource's availability switches due to this. ie 1) "AVAILIABLE" write fails and goes to DLQ, 2) "DELETE" write passes, 3) DLQ redrive changes the ES doc from DELTED -> AVAILIVABLE

A thing to note this DLQ redrive is a manual process and in reality I suspect that this operation would need a runbook laying out when to 'redrive' the DLQ and when not to

I think that guaranteeing that only the failed messages go to the DLQ is a very desirable property of the system. Otherwise ops become harder for customers for no good reason (why are there so many DLQ messages? How come only 6% of them actually failed? How can I know which of them actually failed?)

Another desirable property is handling out of order messages. Our current implementation does not do that(not the same as idempotency). It could be achieved by updating ES only if the vid of the incoming message is higher than the vid of the document in ES. This would make it safe to redrive DLQ messages. I think we can tackle this later as a separate issue.

IMO sending only the failed messages to the DLQ should be done now (can still be a different PR). I agree that BisectBatch has scarce documentation, but is worth testing it out. Maybe MaxRetry=4 and BisectBatch=true with our BatchSize=15 will effectively isolate the error to a single record.

The cheap alternative is MaxRetry=1.

I worry batch size of 1 may slow down our our sync too much

My intuition tells me the same, but we need data in order to discard that approach

nguyen102 · 2021-04-22T20:28:37Z

If we're making the change to throw an error, we should set the maxRetry to 1.
https://github.com/awslabs/fhir-works-on-aws-deployment/blob/a03bdc6b27fe5eca8708d415f98cd60e30b1f897/serverless.yaml#L119

Otherwise, if we fail to update ES because we have too many open sockets, then the retries will make it even worse.

Also, out of curiosity, how will this link up with pushing items to the DLQ. Is there a mechanism that will automatically push the record of failed requests to the DLQ?

rsmayda · 2021-04-22T21:17:13Z

In reply to @nguyen102:

maxRetry to 1.

In the case of ES throttling we would want to retry right? Perhaps it is better for the lambda itself to do retries?

how will this link up with pushing items to the DLQ. Is there a mechanism that will automatically push the record of failed requests to the DLQ?

The EventStream tries maxRetry number of times if the lambda fails past that - a message is dropped into the DLQ containing the EventStream batch information

nguyen102 · 2021-04-23T17:13:35Z

In reply to @rsmayda :

maxRetry to 1.

In the case of ES throttling we would want to retry right? Perhaps it is better for the lambda itself to do retries?
Yep, agreed. We should have lambda, itself, retry, instead of having the stream invoke the lambda again.

The EventStream tries maxRetry number of times if the lambda fails past that - a message is dropped into the DLQ containing the EventStream batch information

Ah, got it. I assume the change for the DLQ will be in the deployment package then?

rsmayda · 2021-04-23T17:42:17Z

@nguyen102: Ah, got it. I assume the change for the DLQ will be in the deployment package then?

Yep: awslabs/fhir-works-on-aws-deployment#295

rsmayda · 2021-04-23T17:45:40Z

@nguyen102 Yep, agreed. We should have lambda, itself, retry, instead of having the stream invoke the lambda again.

I think this would be a separate PR? This PR is focusing on improving logging -- happy to discuss otherwise too

nguyen102 · 2021-04-23T17:49:45Z

@nguyen102 Yep, agreed. We should have lambda, itself, retry, instead of having the stream invoke the lambda again.

I think this would be a separate PR? This PR is focusing on improving logging -- happy to discuss otherwise too

Yep, agreed. I'm ok with it being in a separate PR. Until then should we change the maxRetry in the deployment package to have a value of 1. From my experience, ES error usually occurs from Lambda opening trying to open too many socket connections. Having the retries will make it worse.

carvantes · 2021-04-26T10:49:35Z

We should have lambda, itself, retry, instead of having the stream invoke the lambda again.

IMO we have a correct setup right now. Lambda code has zero retries, and the EventSourceMapping has MaximumRetryAttempts=3. Are you suggesting to have Lambda code do retries? If so, What's the benefit we are pursuing in doing so? It'd make the e2e retry behavior more complex.

Having the retries will make it worse.

Why?

It's very likely that #69 is the cause of the EMFILE errors (creating lots and lots of ES client is a bad idea). Plus, retries don't increase the number of concurrent requests to ES. Streams MUST be processed in order, so retries on a batch will block the processing of all the other incoming messages until max retries are reached.

carvantes

Approving since this is a step in the right direction in terms of error logging.

It may be a good idea to sync up offline about the retry & DLQ setup since there are many comments about it.

nguyen102 · 2021-04-26T13:59:17Z

Streams MUST be processed in order, so retries on a batch will block the processing of all the other incoming messages until max retries are reached.

Oh I see, assuming #69 is the solution to the EMFILE errors, that would alleviate my concern for the retries. Good point about the retry not causing EMFILE errors to get worse. Without the retries we would still have EMFILE errors as the lamda process new batches and once again encounter the error. Although the retry does cause the EMFILE error to occur for 3 times longer if we do get to this state.

I'm ok, with approving this PR and merging in #69. We can gather more data after this and figure out best step forward.

fix: improve error logging for ddbToEs sync

e0b7f10

rsmayda self-assigned this Apr 22, 2021

rsmayda requested review from carvantes and nguyen102 April 22, 2021 02:53

rsmayda added the enhancement New feature or request label Apr 22, 2021

github-actions bot added the size/m label Apr 22, 2021

carvantes reviewed Apr 22, 2021

View reviewed changes

rsmayda requested a review from carvantes April 23, 2021 13:31

fix: Update src/ddbToEs/index.ts

c41390a

carvantes approved these changes Apr 26, 2021

View reviewed changes

nguyen102 approved these changes Apr 26, 2021

View reviewed changes

rsmayda merged commit 5774b34 into mainline Apr 26, 2021

rsmayda deleted the logging branch April 26, 2021 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve error logging for ddbToEs sync #68

fix: improve error logging for ddbToEs sync #68

rsmayda commented Apr 22, 2021

carvantes Apr 22, 2021

rsmayda Apr 22, 2021 •

edited

Loading

carvantes Apr 26, 2021

nguyen102 commented Apr 22, 2021

rsmayda commented Apr 22, 2021

nguyen102 commented Apr 23, 2021

rsmayda commented Apr 23, 2021

rsmayda commented Apr 23, 2021

nguyen102 commented Apr 23, 2021

carvantes commented Apr 26, 2021 •

edited

Loading

carvantes left a comment •

edited

Loading

nguyen102 commented Apr 26, 2021

fix: improve error logging for ddbToEs sync #68

fix: improve error logging for ddbToEs sync #68

Conversation

rsmayda commented Apr 22, 2021

carvantes Apr 22, 2021

Choose a reason for hiding this comment

rsmayda Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

carvantes Apr 26, 2021

Choose a reason for hiding this comment

nguyen102 commented Apr 22, 2021

rsmayda commented Apr 22, 2021

nguyen102 commented Apr 23, 2021

rsmayda commented Apr 23, 2021

rsmayda commented Apr 23, 2021

nguyen102 commented Apr 23, 2021

carvantes commented Apr 26, 2021 • edited Loading

carvantes left a comment • edited Loading

Choose a reason for hiding this comment

nguyen102 commented Apr 26, 2021

rsmayda Apr 22, 2021 •

edited

Loading

carvantes commented Apr 26, 2021 •

edited

Loading

carvantes left a comment •

edited

Loading