Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/add back repair form lock #1179

Merged
merged 3 commits into from
Mar 7, 2024

Conversation

alexanderleegs
Copy link
Contributor

This PR adds back the lock for site repair form. Previously we removed the lock due to the issues we were facing on prod - this was most likely due to the lack of response to forms causing unexpected 4xx errors, in addition to the large number of repos we were repairing at the same time. This PR adds in the response to forms so that we can avoid that issue.

In addition, the time for locking has been set at 15 minutes instead of a scaling amount as discussed offline - this is because repos are being locked and repaired concurrently and will unlock when their respective operation completes without having to wait for the entire set to complete.

@alexanderleegs alexanderleegs requested a review from a team March 1, 2024 01:25
Copy link
Contributor

@kishore03109 kishore03109 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am quite confused about this pr.

Previously we removed the lock due to the issues we were facing on prod - this was most likely due to the lack of response to forms causing unexpected 4xx errors, in addition to the large number of repos we were repairing at the same time.

but there is only one response per form right?

unless we submitted multiple forms, it is unlikely that this caused the high number of 404s right?

In addition, the time for locking has been set at 15 minutes instead of a scaling amount as discussed offline - this is because repos are being locked and repaired concurrently and will unlock when their respective operation completes without having to wait for the entire set to complete.

i remember you mentioned that some repos like nlb took longer than 15 minutes to complete. is this going to be an issue?

Copy link
Contributor Author

I was unable to verify this, but we originally observed that the env health was degraded due to a high number of errors - as we were running this off-peak, it's possible that this was the only request going through, which caused the env health issues.

For long running repos, it might indeed be a problem - i think this is a trade-off to consider! By having a longer lock time, it increases the risk of the user being locked out of their repo for a longer time in the case where the dynamodb query to unlock the repo fails

@kishore03109
Copy link
Contributor

kishore03109 commented Mar 1, 2024

I was unable to verify this, but we originally observed that the env health was degraded due to a high number of errors - as we were running this off-peak, it's possible that this was the only request going through, which caused the env health issues.

may i get a bit more context on why we couldnt verify this ya? did we look through the web.out logs during this time? maybe this might be to the multiple retries option by forms, and can be solved by just returning early?

For long running repos, it might indeed be a problem - i think this is a trade-off to consider! By having a longer lock time, it increases the risk of the user being locked out of their repo for a longer time in the case where the dynamodb query to unlock the repo fails

could we be more intentional about this? ie we take the worst case time + 5 mins?
i am assuming we have the logic to delete the lock once the operation is done, this is in case something goes wrong and it takes a lot longer to resolve this?

also running of ideas here, could it be that the way the form is set up, it struggles to modify the content if there is a lock? my understanding here is that it is a no, since the lock logic is configured at a middleware level, and as thus cannot be affected by the lock logic

@alexanderleegs
Copy link
Contributor Author

may i get a bit more context on why we couldnt verify this ya? did we look through the web.out logs during this time? maybe this might be to the multiple retries option by forms, and can be solved by just returning early?

There were no errors in the logs at the time! This was probably due to our logs only being for application level, whereas the error was mostly due to gateway timeout (due to us not returning a response to forms)

could we be more intentional about this? ie we take the worst case time + 5 mins? i am assuming we have the logic to delete the lock once the operation is done, this is in case something goes wrong and it takes a lot longer to resolve this?

Yup this is in an unhappy state where we fail to release the lock - normally once the operation is done the lock is released. We don't have an accurate time for the worst case scenario at the moment, I thought 15 minutes is a reasonable estimate for a single repo

also running of ideas here, could it be that the way the form is set up, it struggles to modify the content if there is a lock? my understanding here is that it is a no, since the lock logic is configured at a middleware level, and as thus cannot be affected by the lock logic

There's a check for lock status at the start of the process - in the case that it's unable to acquire the lock, the repo is immediately goes to the failure state, which is handled gracefully and is represented in the email sent to users

Tested with repos - they are able to lock/unlock independently so no need for scaling lock time
@alexanderleegs alexanderleegs force-pushed the feat/add-back-repair-form-lock branch from 397654c to 2943859 Compare March 7, 2024 06:55
@alexanderleegs alexanderleegs merged commit 48768e1 into develop Mar 7, 2024
7 of 9 checks passed
@mergify mergify bot deleted the feat/add-back-repair-form-lock branch March 7, 2024 06:55
@alexanderleegs alexanderleegs mentioned this pull request Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants