Skip to content
This repository has been archived by the owner on Apr 13, 2023. It is now read-only.

fix: improve error logging for ddbToEs sync #68

Merged
merged 2 commits into from
Apr 26, 2021
Merged

fix: improve error logging for ddbToEs sync #68

merged 2 commits into from
Apr 26, 2021

Conversation

rsmayda
Copy link
Contributor

@rsmayda rsmayda commented Apr 22, 2021

Issue #, if available: #18, awslabs/fhir-works-on-aws-search-es#21

The goal with this PR is:

  1. throw an error if the lambda can't process (causing retries/failures)
  2. if errors do occur put in the logs:
    1. the resources that were 'bad'
    2. all the resources in this batch that may not have been processed

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rsmayda rsmayda self-assigned this Apr 22, 2021
@rsmayda rsmayda requested review from carvantes and nguyen102 April 22, 2021 02:53
@rsmayda rsmayda added the enhancement New feature or request label Apr 22, 2021
src/ddbToEs/index.ts Outdated Show resolved Hide resolved
);

console.error('Failed to update ES records', e);
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this will result in some messages that were processed successfully being sent to the dlq, since a single message makes the batch fail and retrying the same batch will continue to fail.

It is common to use a batch size of 1 to workaround this issue. An alternative is to enable BisectBatchOnFunctionError although I haven't used that setting before an I'm not sure about how it interacts with MaximumRetryAttempts

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-eventsourcemapping.html#cfn-lambda-eventsourcemapping-bisectbatchonfunctionerror

Copy link
Contributor Author

@rsmayda rsmayda Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that is exactly the case some messages will succeed but the batch will fail if a single message fails.

I worry batch size of 1 may slow down our our sync too much. I looked into BisectBatch but as you mentioned I was not sure how it works with MaxRetry and I wasn't able to find documentation around it either. I suspect that it will Bisect at most MaxRetry times.

These writes are mostly idempotent, but there could be a use-case of a resource's availability switches due to this. ie 1) "AVAILIABLE" write fails and goes to DLQ, 2) "DELETE" write passes, 3) DLQ redrive changes the ES doc from DELTED -> AVAILIVABLE

A thing to note this DLQ redrive is a manual process and in reality I suspect that this operation would need a runbook laying out when to 'redrive' the DLQ and when not to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that guaranteeing that only the failed messages go to the DLQ is a very desirable property of the system. Otherwise ops become harder for customers for no good reason (why are there so many DLQ messages? How come only 6% of them actually failed? How can I know which of them actually failed?)

Another desirable property is handling out of order messages. Our current implementation does not do that(not the same as idempotency). It could be achieved by updating ES only if the vid of the incoming message is higher than the vid of the document in ES. This would make it safe to redrive DLQ messages. I think we can tackle this later as a separate issue.

IMO sending only the failed messages to the DLQ should be done now (can still be a different PR). I agree that BisectBatch has scarce documentation, but is worth testing it out. Maybe MaxRetry=4 and BisectBatch=true with our BatchSize=15 will effectively isolate the error to a single record.

The cheap alternative is MaxRetry=1.

I worry batch size of 1 may slow down our our sync too much

My intuition tells me the same, but we need data in order to discard that approach

@nguyen102
Copy link
Contributor

If we're making the change to throw an error, we should set the maxRetry to 1.
https://github.com/awslabs/fhir-works-on-aws-deployment/blob/a03bdc6b27fe5eca8708d415f98cd60e30b1f897/serverless.yaml#L119

Otherwise, if we fail to update ES because we have too many open sockets, then the retries will make it even worse.

Also, out of curiosity, how will this link up with pushing items to the DLQ. Is there a mechanism that will automatically push the record of failed requests to the DLQ?

@rsmayda
Copy link
Contributor Author

rsmayda commented Apr 22, 2021

In reply to @nguyen102:

maxRetry to 1.

In the case of ES throttling we would want to retry right? Perhaps it is better for the lambda itself to do retries?

how will this link up with pushing items to the DLQ. Is there a mechanism that will automatically push the record of failed requests to the DLQ?

The EventStream tries maxRetry number of times if the lambda fails past that - a message is dropped into the DLQ containing the EventStream batch information

@rsmayda rsmayda requested a review from carvantes April 23, 2021 13:31
@nguyen102
Copy link
Contributor

In reply to @rsmayda :

maxRetry to 1.

In the case of ES throttling we would want to retry right? Perhaps it is better for the lambda itself to do retries?
Yep, agreed. We should have lambda, itself, retry, instead of having the stream invoke the lambda again.

The EventStream tries maxRetry number of times if the lambda fails past that - a message is dropped into the DLQ containing the EventStream batch information

Ah, got it. I assume the change for the DLQ will be in the deployment package then?

@rsmayda
Copy link
Contributor Author

rsmayda commented Apr 23, 2021

@nguyen102: Ah, got it. I assume the change for the DLQ will be in the deployment package then?

Yep: awslabs/fhir-works-on-aws-deployment#295

@rsmayda
Copy link
Contributor Author

rsmayda commented Apr 23, 2021

@nguyen102 Yep, agreed. We should have lambda, itself, retry, instead of having the stream invoke the lambda again.

I think this would be a separate PR? This PR is focusing on improving logging -- happy to discuss otherwise too

@nguyen102
Copy link
Contributor

@nguyen102 Yep, agreed. We should have lambda, itself, retry, instead of having the stream invoke the lambda again.

I think this would be a separate PR? This PR is focusing on improving logging -- happy to discuss otherwise too

Yep, agreed. I'm ok with it being in a separate PR. Until then should we change the maxRetry in the deployment package to have a value of 1. From my experience, ES error usually occurs from Lambda opening trying to open too many socket connections. Having the retries will make it worse.

@carvantes
Copy link
Contributor

carvantes commented Apr 26, 2021

We should have lambda, itself, retry, instead of having the stream invoke the lambda again.

IMO we have a correct setup right now. Lambda code has zero retries, and the EventSourceMapping has MaximumRetryAttempts=3. Are you suggesting to have Lambda code do retries? If so, What's the benefit we are pursuing in doing so? It'd make the e2e retry behavior more complex.


Having the retries will make it worse.

Why?

It's very likely that #69 is the cause of the EMFILE errors (creating lots and lots of ES client is a bad idea). Plus, retries don't increase the number of concurrent requests to ES. Streams MUST be processed in order, so retries on a batch will block the processing of all the other incoming messages until max retries are reached.

Copy link
Contributor

@carvantes carvantes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since this is a step in the right direction in terms of error logging.

It may be a good idea to sync up offline about the retry & DLQ setup since there are many comments about it.

@nguyen102
Copy link
Contributor

Streams MUST be processed in order, so retries on a batch will block the processing of all the other incoming messages until max retries are reached.

Oh I see, assuming #69 is the solution to the EMFILE errors, that would alleviate my concern for the retries. Good point about the retry not causing EMFILE errors to get worse. Without the retries we would still have EMFILE errors as the lamda process new batches and once again encounter the error. Although the retry does cause the EMFILE error to occur for 3 times longer if we do get to this state.

I'm ok, with approving this PR and merging in #69. We can gather more data after this and figure out best step forward.

@rsmayda rsmayda merged commit 5774b34 into mainline Apr 26, 2021
@rsmayda rsmayda deleted the logging branch April 26, 2021 14:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request size/m
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants