Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Before erroring out due to a missing entry in the service map, run service discovery once #511

Open
filmaj opened this issue Dec 17, 2021 · 0 comments
Assignees

Comments

@filmaj
Copy link
Member

filmaj commented Dec 17, 2021

Relevant discussion in Discord: https://discord.com/channels/880272256100601927/884130225280139394/921455993135697972

This morning I was involved in a production deploy that led to some weird and one-off runtime errors. Working in a rather large arc app that has pretty consistent traffic and takes about 5 mins to deploy the full cloudformation stack, we added a new @event. After deploying to prod, we saw the lambda for a dynamodb data stream trigger we have in place start erroring out with 'unknown event '. This particular line of code is the source of the error we were seeing: https://github.com/architect/functions/blob/main/src/events/publish-topic.js#L15-L19

Taking a look at the cloudformation event logs for the deploy, I did notice this particular timeline of events:

  1. 2021-12-17 09:20:42 UTC-0500 NewEventTopic CREATE_IN_PROGRESS: the SNS topic begins to be created
  2. 2021-12-17 09:20:43 UTC-0500 NewEventLambda CREATE_IN_PROGRESS: the lambda for the new event begins to be created
  3. 2021-12-17 09:20:45 UTC-0500 DataStreamLambda UPDATE_IN_PROGRESS: code for the data stream lambda begins to be updated
  4. 2021-12-17 09:20:52 UTC-0500 NewEventLambda CREATE_COMPLETE new event lambda creation complete
  5. 2021-12-17 09:20:53 UTC-0500 NewEventTopic CREATE_COMPLETE SNS topic creation complete
  6. 2021-12-17 09:20:55 UTC-0500 DataStreamLambda UPDATE_COMPLETE data stream lambda update complete; new code in the data stream lambda that publishes to the new topic is now live at this point IIUC?
  7. 2021-12-17 09:20:56 UTC-0500 NewEventTopicParam CREATE_IN_PROGRESS The SSM parameter that informs @architect/functions’ service map on where the new event SNS topic exists begins to be created
  8. 2021-12-17 09:20:59 UTC-0500 NewEventTopicParam CREATE_COMPLETE the SSM parameter informing the service map is now ready

So, if my snooping around is correct, between steps 6 and 8 (about a 4 second window), when lambda code in the data stream is live and when the SSM parameter informing arc/functions where the new event we created exists is live, any executions of the data stream lambda would not know where to look for for the new event, and trigger the ReferenceError inside arc/functions.

Two solutions considered:

  1. Can we front-load creation of SSM parameters during a deploy if that's possible? Probably only via the use of DependsOn, and the only way I can see this working out is if we mark all Lambdas depending on all SSM parameters. Probably overkill?
  2. Instead of erroring out immediately in arc/functions if something can't be found in the service map, we invoke retrieving the service map from SSM Parameter Store one more time? so instead of creating a ReferenceError, we run the service discovery routine maybe one more time, and if it still can't be found, then we error out?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant