-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/add back repair form lock #1179
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am quite confused about this pr.
Previously we removed the lock due to the issues we were facing on prod - this was most likely due to the lack of response to forms causing unexpected 4xx errors, in addition to the large number of repos we were repairing at the same time.
but there is only one response per form right?
unless we submitted multiple forms, it is unlikely that this caused the high number of 404s right?
In addition, the time for locking has been set at 15 minutes instead of a scaling amount as discussed offline - this is because repos are being locked and repaired concurrently and will unlock when their respective operation completes without having to wait for the entire set to complete.
i remember you mentioned that some repos like nlb took longer than 15 minutes to complete. is this going to be an issue?
I was unable to verify this, but we originally observed that the env health was degraded due to a high number of errors - as we were running this off-peak, it's possible that this was the only request going through, which caused the env health issues. For long running repos, it might indeed be a problem - i think this is a trade-off to consider! By having a longer lock time, it increases the risk of the user being locked out of their repo for a longer time in the case where the dynamodb query to unlock the repo fails |
may i get a bit more context on why we couldnt verify this ya? did we look through the web.out logs during this time? maybe this might be to the multiple retries option by forms, and can be solved by just returning early? For long running repos, it might indeed be a problem - i think this is a trade-off to consider! By having a longer lock time, it increases the risk of the user being locked out of their repo for a longer time in the case where the dynamodb query to unlock the repo fails could we be more intentional about this? ie we take the worst case time + 5 mins? also running of ideas here, could it be that the way the form is set up, it struggles to modify the content if there is a lock? my understanding here is that it is a no, since the lock logic is configured at a middleware level, and as thus cannot be affected by the lock logic |
There were no errors in the logs at the time! This was probably due to our logs only being for application level, whereas the error was mostly due to gateway timeout (due to us not returning a response to forms)
Yup this is in an unhappy state where we fail to release the lock - normally once the operation is done the lock is released. We don't have an accurate time for the worst case scenario at the moment, I thought 15 minutes is a reasonable estimate for a single repo
There's a check for lock status at the start of the process - in the case that it's unable to acquire the lock, the repo is immediately goes to the failure state, which is handled gracefully and is represented in the email sent to users |
This reverts commit a746288.
Tested with repos - they are able to lock/unlock independently so no need for scaling lock time
397654c
to
2943859
Compare
This PR adds back the lock for site repair form. Previously we removed the lock due to the issues we were facing on prod - this was most likely due to the lack of response to forms causing unexpected 4xx errors, in addition to the large number of repos we were repairing at the same time. This PR adds in the response to forms so that we can avoid that issue.
In addition, the time for locking has been set at 15 minutes instead of a scaling amount as discussed offline - this is because repos are being locked and repaired concurrently and will unlock when their respective operation completes without having to wait for the entire set to complete.