Feat/add back repair form lock #1179

alexanderleegs · 2024-03-01T01:25:35Z

This PR adds back the lock for site repair form. Previously we removed the lock due to the issues we were facing on prod - this was most likely due to the lack of response to forms causing unexpected 4xx errors, in addition to the large number of repos we were repairing at the same time. This PR adds in the response to forms so that we can avoid that issue.

In addition, the time for locking has been set at 15 minutes instead of a scaling amount as discussed offline - this is because repos are being locked and repaired concurrently and will unlock when their respective operation completes without having to wait for the entire set to complete.

kishore03109

i am quite confused about this pr.

Previously we removed the lock due to the issues we were facing on prod - this was most likely due to the lack of response to forms causing unexpected 4xx errors, in addition to the large number of repos we were repairing at the same time.

but there is only one response per form right?

unless we submitted multiple forms, it is unlikely that this caused the high number of 404s right?

In addition, the time for locking has been set at 15 minutes instead of a scaling amount as discussed offline - this is because repos are being locked and repaired concurrently and will unlock when their respective operation completes without having to wait for the entire set to complete.

i remember you mentioned that some repos like nlb took longer than 15 minutes to complete. is this going to be an issue?

alexanderleegs · 2024-03-01T01:47:16Z

I was unable to verify this, but we originally observed that the env health was degraded due to a high number of errors - as we were running this off-peak, it's possible that this was the only request going through, which caused the env health issues.

For long running repos, it might indeed be a problem - i think this is a trade-off to consider! By having a longer lock time, it increases the risk of the user being locked out of their repo for a longer time in the case where the dynamodb query to unlock the repo fails

kishore03109 · 2024-03-01T01:53:23Z

I was unable to verify this, but we originally observed that the env health was degraded due to a high number of errors - as we were running this off-peak, it's possible that this was the only request going through, which caused the env health issues.

may i get a bit more context on why we couldnt verify this ya? did we look through the web.out logs during this time? maybe this might be to the multiple retries option by forms, and can be solved by just returning early?

For long running repos, it might indeed be a problem - i think this is a trade-off to consider! By having a longer lock time, it increases the risk of the user being locked out of their repo for a longer time in the case where the dynamodb query to unlock the repo fails

could we be more intentional about this? ie we take the worst case time + 5 mins?
i am assuming we have the logic to delete the lock once the operation is done, this is in case something goes wrong and it takes a lot longer to resolve this?

also running of ideas here, could it be that the way the form is set up, it struggles to modify the content if there is a lock? my understanding here is that it is a no, since the lock logic is configured at a middleware level, and as thus cannot be affected by the lock logic

alexanderleegs · 2024-03-01T05:44:53Z

may i get a bit more context on why we couldnt verify this ya? did we look through the web.out logs during this time? maybe this might be to the multiple retries option by forms, and can be solved by just returning early?

There were no errors in the logs at the time! This was probably due to our logs only being for application level, whereas the error was mostly due to gateway timeout (due to us not returning a response to forms)

could we be more intentional about this? ie we take the worst case time + 5 mins? i am assuming we have the logic to delete the lock once the operation is done, this is in case something goes wrong and it takes a lot longer to resolve this?

Yup this is in an unhappy state where we fail to release the lock - normally once the operation is done the lock is released. We don't have an accurate time for the worst case scenario at the moment, I thought 15 minutes is a reasonable estimate for a single repo

also running of ideas here, could it be that the way the form is set up, it struggles to modify the content if there is a lock? my understanding here is that it is a no, since the lock logic is configured at a middleware level, and as thus cannot be affected by the lock logic

There's a check for lock status at the start of the process - in the case that it's unable to acquire the lock, the repo is immediately goes to the failure state, which is handled gracefully and is represented in the email sent to users

This reverts commit a746288.

Tested with repos - they are able to lock/unlock independently so no need for scaling lock time

alexanderleegs requested a review from a team March 1, 2024 01:25

kishore03109 reviewed Mar 1, 2024

View reviewed changes

alexanderleegs requested a review from kishore03109 March 4, 2024 07:55

kishore03109 approved these changes Mar 6, 2024

View reviewed changes

alexanderleegs added 3 commits March 7, 2024 14:54

Revert "Revert "Chore/lock repos when repairing (#1149)""

cb49bce

This reverts commit a746288.

fix: return 200 to forms and adjust lock time

3ec00a2

chore: revert to 15 minutes

2943859

Tested with repos - they are able to lock/unlock independently so no need for scaling lock time

alexanderleegs force-pushed the feat/add-back-repair-form-lock branch from 397654c to 2943859 Compare March 7, 2024 06:55

alexanderleegs merged commit 48768e1 into develop Mar 7, 2024
7 of 9 checks passed

mergify bot deleted the feat/add-back-repair-form-lock branch March 7, 2024 06:55

alexanderleegs mentioned this pull request Mar 7, 2024

0.69.0 #1190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/add back repair form lock #1179

Feat/add back repair form lock #1179

alexanderleegs commented Mar 1, 2024

kishore03109 left a comment •

edited

Loading

alexanderleegs commented Mar 1, 2024

kishore03109 commented Mar 1, 2024 •

edited

Loading

alexanderleegs commented Mar 1, 2024

Feat/add back repair form lock #1179

Feat/add back repair form lock #1179

Conversation

alexanderleegs commented Mar 1, 2024

kishore03109 left a comment • edited Loading

Choose a reason for hiding this comment

alexanderleegs commented Mar 1, 2024

kishore03109 commented Mar 1, 2024 • edited Loading

alexanderleegs commented Mar 1, 2024

kishore03109 left a comment •

edited

Loading

kishore03109 commented Mar 1, 2024 •

edited

Loading