Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework RUCSS retry strategy #6222

Closed
MathieuLamiot opened this issue Oct 13, 2023 · 7 comments · Fixed by #6256
Closed

Rework RUCSS retry strategy #6222

MathieuLamiot opened this issue Oct 13, 2023 · 7 comments · Fixed by #6256
Assignees
Labels
effort: [M] 3-5 days of estimated development time module: remove unused css needs: documentation Issues which need to create or update a documentation priority: high Issues which should be resolved as quickly as possible type: enhancement Improvements that slightly enhance existing functionality and are fast to implement
Milestone

Comments

@MathieuLamiot
Copy link
Contributor

MathieuLamiot commented Oct 13, 2023

Context
The current retry strategy is as follows in check_job_status, supposed to be triggered every minute:

  • Retry up to 3 times in case we don't have a 200 Response with a result.
  • In case of 408, we create submit a new job, and replace the job ID in the DB. Number of retries does not reset.

This has several issues:

  • We store jobs longer than 3 minutes in the SaaS and we often don't take advantage of this.
  • There might be some queue on the SaaS, delaying the job for more than 3 minutes, at peak hours.
  • If a 408 happens on the 2nd try, the new job won't have many chances.

Expected behavior
Rework the behavior in case the SaaS does not return a job result so that:

  1. Have a different retry behavior if the job has been found but has no result yet (not processed yet) or if it was not found on the SaaS.
  2. DEFAULT BEHAVIOR: In case the job has been found but with no results (400) or if the request to the server times out, the plugin must retry for longer than 3 minutes after the job has been sent. This timing must be filterable, but a good target would be up to 1 hour with exponential back-off.
  3. In case the job has not been found on the SaaS (404), the job must be set to Fail without further retries.
  4. In case of 408, a new job must be created. The retry strategy must be reset.
  5. In case of 422 (job result invalid format), the job must be set to Fail without further retries.
  6. In case of 500 or 401, the job must be set to Fail without further retries.
  7. In any other case, the job must follow the same retry strategy as for a 400 (this is the default behavior).

In case of timeout during the request from the plugin to the server, we follow the same strategy as with a 400. A dedicated error code and error message should be handled on the plugin side (maybe 504 Server Timeout ?)

Suggested solutions
For point 1, we should leverage the details provided here.
For point 2, the ideal approach would be an exponential backoff: Try after 1 minute, then 2 minutes, then 5, then 10, then 10 for instance. Another quick&dirty approach could be to increase the number of retries. Both should be groomed to decide.

EDIT
After discussions, we will go with the exponential strategy implementation and distinct error codes.

Acceptance Criteria

  • Non-regression in the case of successful job (code 200): the row must be set to successful.
  • Submit a new job and force the job fetch to return a 400: the plugin must retry with exponential back-off (increasing delays between fetch tries), increment retries every time, log the error in error messages, and keep trying with a pending/in-progress state for 1 hour. After 1 hour since the job submission, the row must be set to failed.
  • Submit a new job and force the job fetch to timeout: the plugin must retry with exponential back-off (increasing delays between fetch tries), increment retries every time, log the error in error messages, and keep trying with a pending/in-progress state for 1 hour. After 1 hour since the job submission, the row must be set to failed.
  • Submit a new job and force the job fetch to return a 404: the plugin must fetch the results once and put the job to fail ultimately.
  • Submit a new job and force the job fetch to return a 408: the row must be set to 'to-submit' so a new job is sent.
  • Submit a new job and force the job fetch to return a 422: the plugin must fetch the results once and put the job to fail ultimately.
  • Submit a new job a force the first job fetch to return a 400 or a timeout, then the next one to return a 200 with correct results. The plugin must therefore retry once and put the row as successful on the 2nd try.
    NOTE: There is a known issue on the SaaS side for the Unauthorized '401') case. It will be considered like a 400 by the plugin. This use-case is to be discarded for now when testing the plugin.
@piotrbak piotrbak added type: enhancement Improvements that slightly enhance existing functionality and are fast to implement priority: high Issues which should be resolved as quickly as possible module: remove unused css needs: grooming labels Oct 23, 2023
@Miraeld Miraeld self-assigned this Oct 30, 2023
@Miraeld
Copy link
Contributor

Miraeld commented Oct 30, 2023

Scope a solution

To solve this issue, we are planning to overhaul the retry system, and it could be interesting to consider adopting a Strategy design pattern approach to achieve this. Here's the proposed plan:

  1. Strategy Design Pattern: We'll utilize the Strategy design pattern, which you can learn more about here. This pattern allows us to define various strategies for handling different API responses.

  2. Namespace Organization: All the code related to these strategies will be placed within a dedicated Strategy namespace for better organization.

  3. Interface Definition: An interface will be created to establish the contract for all these strategies. This interface serves as a blueprint for all future strategy implementations.

  4. Strategy Classes: We will develop multiple classes, each implementing the interface, to represent different strategies for handling API responses. These classes will define what actions to take based on the API result. Which means we will be able to a different strategy depending on the job_result from the API.

  5. Controller Integration: Within our application's controller (the "RUCSS Controller"), we will integrate these strategy classes to make decisions on how to handle the API response. Depending on the result received from the API, we will choose the appropriate strategy to execute.

  6. Future-Proofing: By adopting the Strategy pattern, we are making our system flexible and ready for future changes. It will be easy to add new strategies or modify existing ones as needed without major code alterations.

While Mathieu says:

In case the job has been found but with no results, the plugin must retry for longer than 3 minutes after the job has been sent. This timing must be filterable, but a good target would be up to 30 minutes.
In case the job has not been found on the SaaS, the plugin must retry once, but not more.
In case of 408, creating a new job must also reset the retry strategy.

We can see here different strategies that will have to be implemented.

For point 2, the ideal approach would be an exponential backoff: Try after 1 minute, then 2 minutes, then 5, then 10, then 10 for instance. Another quick&dirty approach could be to increase the number of retries. Both should be groomed to decide.

I think it would be nice to go with the exponential approach. For this, we can create a new column in the database called not_proceed_before which would be a timestamp.
Then in the strategy, we could check the column retries to get the information to which level we are, and calculate a timestamp stored in not_proceed_before.
Then in the function that process the in-progress jobs here, we can add a check to make sure the not_proceed_before timestamp of the pending job is < now() .

Development steps:

  • [] Implement the Strategy Interface,
  • [] Implement the Strategies
  • [] Use strategies within the RUCSS Controller

Effort estimation:

M

Is a refactor needed in that part of the codebase?

Yes, as explain in the solution

@Miraeld Miraeld removed their assignment Oct 30, 2023
@CrochetFeve0251
Copy link
Contributor

@Miraeld could you provide an estimation of the effort for the exponential waiting time?

@CrochetFeve0251
Copy link
Contributor

CrochetFeve0251 commented Oct 30, 2023

I would like to complete the grooming: to switch between different strategies we gonna use a StrategyFactory which gonna take in parameter the response from the API and which return the right strategy based on it.
We could then set the strategy on the context.

Otherwise that seems good to me.

@Miraeld
Copy link
Contributor

Miraeld commented Oct 30, 2023

@CrochetFeve0251 , I would say a S for me, but an XS for you and probably others :)

@Miraeld Miraeld self-assigned this Oct 31, 2023
@Miraeld Miraeld added effort: [M] 3-5 days of estimated development time and removed needs: grooming labels Oct 31, 2023
@MathieuLamiot
Copy link
Contributor Author

Error code refinement discussed here: https://wp-media.slack.com/archives/CUT7FLHF1/p1698749275154029

@MathieuLamiot
Copy link
Contributor Author

I updated the ticket based on the Slack discussion, and added Acceptance Criteria as well to better frame the behavior.

@MathieuLamiot
Copy link
Contributor Author

Another quick update to simplify the 404 and 422 behavior (fail immediately).
I documented a recap of the expected behavior on Notion: https://www.notion.so/wpmedia/TDD-RUCSS-SaaS-f69dfa7c2a9f4687a7fe7c2b3920aaa0?pvs=4#b50c86a037754405b5c18ed58704832c

@piotrbak piotrbak added this to the 3.15.5 milestone Nov 20, 2023
github-merge-queue bot pushed a commit that referenced this issue Nov 23, 2023
Co-authored-by: Vasilis Manthos <[email protected]>
Co-authored-by: COQUARD Cyrille <[email protected]>
@piotrbak piotrbak added the needs: documentation Issues which need to create or update a documentation label Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort: [M] 3-5 days of estimated development time module: remove unused css needs: documentation Issues which need to create or update a documentation priority: high Issues which should be resolved as quickly as possible type: enhancement Improvements that slightly enhance existing functionality and are fast to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants