Rework RUCSS retry strategy #6222

MathieuLamiot · 2023-10-13T11:03:25Z

Context
The current retry strategy is as follows in check_job_status, supposed to be triggered every minute:

Retry up to 3 times in case we don't have a 200 Response with a result.
In case of 408, we create submit a new job, and replace the job ID in the DB. Number of retries does not reset.

This has several issues:

We store jobs longer than 3 minutes in the SaaS and we often don't take advantage of this.
There might be some queue on the SaaS, delaying the job for more than 3 minutes, at peak hours.
If a 408 happens on the 2nd try, the new job won't have many chances.

Expected behavior
Rework the behavior in case the SaaS does not return a job result so that:

Have a different retry behavior if the job has been found but has no result yet (not processed yet) or if it was not found on the SaaS.
DEFAULT BEHAVIOR: In case the job has been found but with no results (400) or if the request to the server times out, the plugin must retry for longer than 3 minutes after the job has been sent. This timing must be filterable, but a good target would be up to 1 hour with exponential back-off.
In case the job has not been found on the SaaS (404), the job must be set to Fail without further retries.
In case of 408, a new job must be created. The retry strategy must be reset.
In case of 422 (job result invalid format), the job must be set to Fail without further retries.
In case of 500 or 401, the job must be set to Fail without further retries.
In any other case, the job must follow the same retry strategy as for a 400 (this is the default behavior).

In case of timeout during the request from the plugin to the server, we follow the same strategy as with a 400. A dedicated error code and error message should be handled on the plugin side (maybe 504 Server Timeout ?)

Suggested solutions
For point 1, we should leverage the details provided here.
For point 2, the ideal approach would be an exponential backoff: Try after 1 minute, then 2 minutes, then 5, then 10, then 10 for instance. Another quick&dirty approach could be to increase the number of retries. Both should be groomed to decide.

EDIT
After discussions, we will go with the exponential strategy implementation and distinct error codes.

Acceptance Criteria

Non-regression in the case of successful job (code 200): the row must be set to successful.
Submit a new job and force the job fetch to return a 400: the plugin must retry with exponential back-off (increasing delays between fetch tries), increment retries every time, log the error in error messages, and keep trying with a pending/in-progress state for 1 hour. After 1 hour since the job submission, the row must be set to failed.
Submit a new job and force the job fetch to timeout: the plugin must retry with exponential back-off (increasing delays between fetch tries), increment retries every time, log the error in error messages, and keep trying with a pending/in-progress state for 1 hour. After 1 hour since the job submission, the row must be set to failed.
Submit a new job and force the job fetch to return a 404: the plugin must fetch the results once and put the job to fail ultimately.
Submit a new job and force the job fetch to return a 408: the row must be set to 'to-submit' so a new job is sent.
Submit a new job and force the job fetch to return a 422: the plugin must fetch the results once and put the job to fail ultimately.
Submit a new job a force the first job fetch to return a 400 or a timeout, then the next one to return a 200 with correct results. The plugin must therefore retry once and put the row as successful on the 2nd try.
NOTE: There is a known issue on the SaaS side for the Unauthorized '401') case. It will be considered like a 400 by the plugin. This use-case is to be discarded for now when testing the plugin.

The text was updated successfully, but these errors were encountered:

Miraeld · 2023-10-30T10:56:30Z

Scope a solution

To solve this issue, we are planning to overhaul the retry system, and it could be interesting to consider adopting a Strategy design pattern approach to achieve this. Here's the proposed plan:

Strategy Design Pattern: We'll utilize the Strategy design pattern, which you can learn more about here. This pattern allows us to define various strategies for handling different API responses.
Namespace Organization: All the code related to these strategies will be placed within a dedicated Strategy namespace for better organization.
Interface Definition: An interface will be created to establish the contract for all these strategies. This interface serves as a blueprint for all future strategy implementations.
Strategy Classes: We will develop multiple classes, each implementing the interface, to represent different strategies for handling API responses. These classes will define what actions to take based on the API result. Which means we will be able to a different strategy depending on the job_result from the API.
Controller Integration: Within our application's controller (the "RUCSS Controller"), we will integrate these strategy classes to make decisions on how to handle the API response. Depending on the result received from the API, we will choose the appropriate strategy to execute.
Future-Proofing: By adopting the Strategy pattern, we are making our system flexible and ready for future changes. It will be easy to add new strategies or modify existing ones as needed without major code alterations.

While Mathieu says:

In case the job has been found but with no results, the plugin must retry for longer than 3 minutes after the job has been sent. This timing must be filterable, but a good target would be up to 30 minutes.
In case the job has not been found on the SaaS, the plugin must retry once, but not more.
In case of 408, creating a new job must also reset the retry strategy.

We can see here different strategies that will have to be implemented.

For point 2, the ideal approach would be an exponential backoff: Try after 1 minute, then 2 minutes, then 5, then 10, then 10 for instance. Another quick&dirty approach could be to increase the number of retries. Both should be groomed to decide.

I think it would be nice to go with the exponential approach. For this, we can create a new column in the database called not_proceed_before which would be a timestamp.
Then in the strategy, we could check the column retries to get the information to which level we are, and calculate a timestamp stored in not_proceed_before.
Then in the function that process the in-progress jobs here, we can add a check to make sure the not_proceed_before timestamp of the pending job is < now() .

Development steps:

[] Implement the Strategy Interface,
[] Implement the Strategies
[] Use strategies within the RUCSS Controller

Effort estimation:

M

Is a refactor needed in that part of the codebase?

Yes, as explain in the solution

CrochetFeve0251 · 2023-10-30T12:00:10Z

@Miraeld could you provide an estimation of the effort for the exponential waiting time?

CrochetFeve0251 · 2023-10-30T12:41:29Z

I would like to complete the grooming: to switch between different strategies we gonna use a StrategyFactory which gonna take in parameter the response from the API and which return the right strategy based on it.
We could then set the strategy on the context.

Otherwise that seems good to me.

Miraeld · 2023-10-30T13:03:23Z

@CrochetFeve0251 , I would say a S for me, but an XS for you and probably others :)

MathieuLamiot · 2023-11-01T10:48:53Z

Error code refinement discussed here: https://wp-media.slack.com/archives/CUT7FLHF1/p1698749275154029

MathieuLamiot · 2023-11-02T11:47:50Z

I updated the ticket based on the Slack discussion, and added Acceptance Criteria as well to better frame the behavior.

MathieuLamiot · 2023-11-02T18:29:13Z

Another quick update to simplify the 404 and 422 behavior (fail immediately).
I documented a recap of the expected behavior on Notion: https://www.notion.so/wpmedia/TDD-RUCSS-SaaS-f69dfa7c2a9f4687a7fe7c2b3920aaa0?pvs=4#b50c86a037754405b5c18ed58704832c

Co-authored-by: Vasilis Manthos <[email protected]> Co-authored-by: COQUARD Cyrille <[email protected]>

piotrbak added type: enhancement Improvements that slightly enhance existing functionality and are fast to implement priority: high Issues which should be resolved as quickly as possible module: remove unused css needs: grooming labels Oct 23, 2023

Miraeld self-assigned this Oct 30, 2023

Miraeld removed their assignment Oct 30, 2023

Miraeld self-assigned this Oct 31, 2023

Miraeld added effort: [M] 3-5 days of estimated development time and removed needs: grooming labels Oct 31, 2023

Miraeld mentioned this issue Oct 31, 2023

Closes #6222: Revamp retry process rucss #6256

Merged

8 tasks

MathieuLamiot mentioned this issue Nov 17, 2023

Prevent infinite loop for RUCSS jobs returning 408 consistently. #6275

Closed

piotrbak added this to the 3.15.5 milestone Nov 20, 2023

vmanthos closed this as completed in #6256 Nov 23, 2023

github-merge-queue bot pushed a commit that referenced this issue Nov 23, 2023

Closes #6222: Revamp retry process rucss (#6256)

da05112

Co-authored-by: Vasilis Manthos <[email protected]> Co-authored-by: COQUARD Cyrille <[email protected]>

piotrbak added the needs: documentation Issues which need to create or update a documentation label Nov 23, 2023

This was referenced Nov 24, 2023

Error code when RUCSS API can't be reached #6289

Closed

Refine the RUCSS retry strategy for edge-cases of the backoff retries #6290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework RUCSS retry strategy #6222

Rework RUCSS retry strategy #6222

MathieuLamiot commented Oct 13, 2023 •

edited

Loading

Miraeld commented Oct 30, 2023

CrochetFeve0251 commented Oct 30, 2023

CrochetFeve0251 commented Oct 30, 2023 •

edited

Loading

Miraeld commented Oct 30, 2023

MathieuLamiot commented Nov 1, 2023

MathieuLamiot commented Nov 2, 2023

MathieuLamiot commented Nov 2, 2023

Rework RUCSS retry strategy #6222

Rework RUCSS retry strategy #6222

Comments

MathieuLamiot commented Oct 13, 2023 • edited Loading

Miraeld commented Oct 30, 2023

Scope a solution

Development steps:

Effort estimation:

Is a refactor needed in that part of the codebase?

CrochetFeve0251 commented Oct 30, 2023

CrochetFeve0251 commented Oct 30, 2023 • edited Loading

Miraeld commented Oct 30, 2023

MathieuLamiot commented Nov 1, 2023

MathieuLamiot commented Nov 2, 2023

MathieuLamiot commented Nov 2, 2023

MathieuLamiot commented Oct 13, 2023 •

edited

Loading

CrochetFeve0251 commented Oct 30, 2023 •

edited

Loading