feat(webhooks): implement automatic retries for failed webhook deliveries using scheduler #3842

SanchithHegde · 2024-02-27T08:30:01Z

Type of Change

Description

This PR adds support for automatically retrying outgoing webhook deliveries in case of failed deliveries, with the aid of the scheduler (process tracker).

Behavior (Initial Delivery)

The behavior when an outgoing webhook is being sent is:

During the initial webhook delivery attempt, a process tracker task is added before sending the webhook, since we launch the webhook delivery task on a background thread that we have no control over. The scheduled time for the task is determined from configuration (persisted in database/Redis), which has been explained later.
Next, we trigger the webhook to the merchant and an analytics event is raised, as before.
1. In case the initial delivery attempt succeeds, the process tracker task is marked completed. The end.
2. In case the initial delivery attempt fails, no further action is taken, the webhook delivery will be attempted again by the process tracker at the scheduled time.

Behavior (Automatic Retry)

During the automatic retry attempt(s) from process tracker, the current information about the resource (about which the webhook is being sent) is fetched from the database and will be included in the webhook payload.
If the status of the resource is different than when the event for the webhook was created (the resource has now transitioned to another status), then the process tracker task is aborted with business status indicating the status mismatch.
When triggering the webhook to the merchant:
1. If the retry attempt succeeds, the process tracker task is marked completed.
2. If the retry attempt fails, the process tracker task is scheduled again for retry at the scheduled time determined from configuration.
3. If the number of retries exceeds the maximum number of retries, then the process tracker task is marked completed with business status indicating that retries exceeded.

Configuring Retry Intervals

The retry intervals for webhook deliveries are determined by runtime configuration (persisted in the database/Redis), using the key pt_mapping_outgoing_webhooks. In case the configuration value is not available in storage, a default configuration value is used, which is harcoded in the application:

hyperswitch/crates/scheduler/src/consumer/types/process_data.rs

Lines 69 to 89 in 1d3bf5c

    
           default_mapping: RetryMapping { 
        
               // 1st attempt happens after 1 minute 
        
               start_after: 60, 
        
               frequency: vec![ 
        
                   // 2nd and 3rd attempts happen at intervals of 5 minutes each 
        
                   60 * 5, 
        
                   // 4th, 5th, 6th, 7th and 8th attempts happen at intervals of 10 minutes each 
        
                   60 * 10, 
        
                   // 9th, 10th, 11th, 12th and 13th attempts happen at intervals of 1 hour each 
        
                   60 * 60, 
        
                   // 14th, 15th and 16th attempts happen at intervals of 6 hours each 
        
                   60 * 60 * 6, 
        
               ], 
        
               count: vec![ 
        
                   2, // 2nd and 3rd attempts 
        
                   5, // 4th, 5th, 6th, 7th and 8th attempts 
        
                   5, // 9th, 10th, 11th, 12th and 13th attempts 
        
                   3, // 14th, 15th and 16th attempts 
        
               ], 
        
           },

In case the default configuration needs to be overridden, it can be done so using the configuration create (or update) endpoint:

curl --location 'http://localhost:8080/configs/' \
  --header 'Content-Type: application/json' \
  --header 'Accept: application/json' \
  --header 'api-key: <ADMIN_API_KEY>' \
  --data '{
    "key": "pt_mapping_outgoing_webhooks",
    "value": "{\"default_mapping\":{\"start_after\":60,\"frequency\":[15,30],\"count\":[2,3]},\"custom_merchant_mapping\":{}}"
  }'

This would override the configuration for all merchants. Note that frequency and start_after are specified in seconds. If it needs to be overridden only for a specific merchant, then the custom_merchant_mapping field must have a similar object as default_mapping, keyed by the merchant ID:

{
  "default_mapping": {
    "start_after": 60,
    "frequency": [15, 30],
    "count": [2, 3]
  },
  "custom_merchant_mapping": {
    "merchant_id1": {
      "start_after": 30,
      "frequency": [300],
      "count": [2]
    }
  }
}

Motivation and Context

Closes #217.

How did you test it?

As of now, outgoing webhooks are supported for payments, refunds, disputes and mandates. I've extensively tested payments outgoing webhooks for the different cases, and done a sanity testing on disputes and refunds webhooks to verify that they are retried in case the initial delivery attempt fails. Incoming mandates webhooks are integrated for the GoCardless connector, but the integration is broken and I couldn't test mandates webhook retries.

As for simulating failed webhook deliveries, the merchant webhook URL would have to be configured to a URL which does not accept POST requests. After a couple of failed retries, the URL can be updated to a valid URL to try out the case where the retried delivery attempt succeeds.

Since the hardcoded default retry configuration spans multiple hours, I configured the application to use shorter intervals:

curl --location 'http://localhost:8080/configs/' \
  --header 'Content-Type: application/json' \
  --header 'Accept: application/json' \
  --header 'api-key: <ADMIN_API_KEY>' \
  --data '{
    "key": "pt_mapping_outgoing_webhooks",
    "value": "{\"default_mapping\":{\"start_after\":60,\"frequency\":[15,30],\"count\":[2,3]},\"custom_merchant_mapping\":{}}"
  }'

The process_tracker table can be queried for relevant record using this query:

SELECT * FROM process_tracker WHERE runner = 'OUTGOING_WEBHOOK_RETRY_WORKFLOW' ORDER BY schedule_time DESC;

If the initial delivery attempt is successful, the business status of the process tracker entry is set to INITIAL_DELIVERY_ATTEMPT_SUCCESSFUL:
If one of the retried delivery attempts are successful, the business status of the process tracker entry is set to COMPLETED_BY_PT:
If none of the delivery attempts are successful, the business status of the process tracker entry is set to RETRIES_EXCEEDED:

In my case, the configured count was [2,3], so the maximum number of retries turns out to be 2 + 3 = 5.

As for testing refunds and disputes webhooks, the screenshots are attached below:

Refunds (note the event_type and event_class fields):
Disputes (note the event_type and event_class fields), the business status is RESOURCE_STATUS_MISMATCH since the dispute status changed since the time it was created to the time the webhook was being retried:

In addition, successful and failed task additions should raise appropriate metrics (TASKS_ADDED_COUNT and TASK_ADDITION_FAILURES_COUNT respectively), and suitable logs are being thrown when delivering webhooks to the merchant fails, and when retry configs are being read from the database.

The PR also adds unit tests for a scheduler utility function I refactored, which can be run using the command:

cargo test --package scheduler --lib -- utils::tests

Checklist

I formatted the code cargo +nightly fmt --all
I addressed lints thrown by cargo clippy
I reviewed the submitted code
I added unit tests for my changes where possible
I added a CHANGELOG entry if applicable

…event_and_trigger_outgoing_webhook()` function

The visibility of `create_event_and_trigger_outgoing_webhook()` and `trigger_webhook_to_merchant()` functions is now private instead of public.

… of `OutgoingWebhookEventMetric` trait to `get_outgoing_webhook_event_content()`

…ing webhook

…ept a `tag` parameter

…ebhook delivery attempt

…ngData`

… when raising analytics event

…outgoing webhook

…ry configuration

…onnectorPTMapping` and `PaymentMethodsPTMapping` types

…d for 2nd retry attempt instead of 1st retry attempt

… scheduler

…in database when retrying webhooks from scheduler

…nse handling code to closures

…tes and disputes webhooks

…urrent resource status

…ng initial delivery attempt

…ailure

crates/router/src/workflows/outgoing_webhook_retry.rs

hrithikesh026

LGTM

crates/router/src/workflows/outgoing_webhook_retry.rs

crates/scheduler/src/consumer/types/process_data.rs

crates/scheduler/src/utils.rs

SanchithHegde added 29 commits February 20, 2024 15:12

chore(api_models): reorder serde derive attribute

b9034dd

refactor(webhooks): remove merchant_account parameter from `create_…

158e08e

…event_and_trigger_outgoing_webhook()` function

refactor(webhooks): change visibility of functions to private

19fb227

The visibility of `create_event_and_trigger_outgoing_webhook()` and `trigger_webhook_to_merchant()` functions is now private instead of public.

refactor(webhooks): extract event logging logic to a function

7c1dbf7

refactor(webhooks): rename get_outgoing_webhook_event_type() method…

e6d61b6

… of `OutgoingWebhookEventMetric` trait to `get_outgoing_webhook_event_content()`

refactor(webhooks): accept delivery attempt as parameter when trigger…

c905365

…ing webhook

refactor(process_tracker): update make_process_tracker_new() to acc…

fba4567

…ept a `tag` parameter

feat(webhooks): add task to process tracker before initial outgoing w…

6d4a41b

…ebhook delivery attempt

chore: fix typo

bca272d

refactor(webhooks): add merchant_id field in `OutgoingWebhookTracki…

b389645

…ngData`

refactor(webhooks): accept event_id instead of event as parameter…

f42a168

… when raising analytics event

refactor(webhooks): decide outgoing webhook type just before sending …

41a421e

…outgoing webhook

Merge branch 'main' into outgoing-webhooks-automatic-retry

ed23a00

chore: fix errors after merging main

39ec3cf

chore: fix typo

c9db463

refactor(scheduler): simplify get_delay() utility function

7852838

fix: use Debug impl to log errors when unable to fetch workflow ret…

30fa7fa

…ry configuration

fix(process_data): remove rename_all = camelCase annotation from `C…

f21bd2d

…onnectorPTMapping` and `PaymentMethodsPTMapping` types

fix(workflows): fix payments sync workflow being incorrectly schedule…

eb96701

…d for 2nd retry attempt instead of 1st retry attempt

feat(webhooks): automatically retry delivery of failed webhooks using…

63a188d

… scheduler

refactor(webhooks): populate timestamp in webhook payload from event …

0aec985

…in database when retrying webhooks from scheduler

refactor(webhooks): extract out common error, success and error respo…

b66a1ee

…nse handling code to closures

feat(outgoing_webhook_retry): add support for retrying refunds, manda…

339dc26

…tes and disputes webhooks

Merge branch 'main' into outgoing-webhooks-automatic-retry

0c85f28

refactor(outgoing_webhook_retry): instrument function calls and log c…

22838c0

…urrent resource status

refactor(webhooks): make process tracker task insertion optional duri…

6df96b8

…ng initial delivery attempt

refactor(webhooks): raise metrics in case of task insertion success/f…

55b9dd0

…ailure

fix(refunds): remove unused function

20e7f8c

Merge branch 'main' into outgoing-webhooks-automatic-retry

1d3bf5c

SanchithHegde added the A-core Area: Core flows label Feb 27, 2024

Merge branch 'main' into outgoing-webhooks-automatic-retry

38185cc

SanchithHegde dismissed sahkal’s stale review via 38185cc February 28, 2024 10:48

sahkal previously approved these changes Feb 28, 2024

View reviewed changes

hrithikesh026 reviewed Feb 28, 2024

View reviewed changes

crates/router/src/workflows/outgoing_webhook_retry.rs Show resolved Hide resolved

hrithikesh026 previously approved these changes Feb 28, 2024

View reviewed changes

Merge branch 'main' into outgoing-webhooks-automatic-retry

2c91e0f

SanchithHegde dismissed stale reviews from hrithikesh026 and sahkal via 2c91e0f February 29, 2024 11:00

sahkal previously approved these changes Feb 29, 2024

View reviewed changes

hrithikesh026 previously approved these changes Feb 29, 2024

View reviewed changes

sai-harsha-vardhan previously approved these changes Mar 1, 2024

View reviewed changes

crates/router/src/workflows/outgoing_webhook_retry.rs Show resolved Hide resolved

Narayanbhat166 previously approved these changes Mar 1, 2024

View reviewed changes

crates/scheduler/src/consumer/types/process_data.rs Show resolved Hide resolved

crates/scheduler/src/utils.rs Outdated Show resolved Hide resolved

SanchithHegde added 2 commits March 3, 2024 18:49

Merge branch 'main' into outgoing-webhooks-automatic-retry

b2dd060

chore(scheduler): add unit tests for get_delay() utility function

a38212f

SanchithHegde dismissed stale reviews from Narayanbhat166, sai-harsha-vardhan, hrithikesh026, and sahkal via a38212f March 3, 2024 13:47

chore: fix typos

52c0934

SanchithHegde requested a review from a team as a code owner March 3, 2024 14:09

Narayanbhat166 approved these changes Mar 4, 2024

View reviewed changes

sahkal approved these changes Mar 4, 2024

View reviewed changes

Narayanbhat166 mentioned this pull request Mar 4, 2024

[BUG] webhooks are not sent again if it fails to send the first time during Psync #3408

Closed

2 tasks

lsampras approved these changes Mar 4, 2024

View reviewed changes

Gnanasundari24 added this pull request to the merge queue Mar 4, 2024

Merged via the queue into main with commit 5bb67c7 Mar 4, 2024
13 of 15 checks passed

Gnanasundari24 deleted the outgoing-webhooks-automatic-retry branch March 4, 2024 06:45

SanchithHegde removed the S-waiting-on-review Status: This PR has been implemented and needs to be reviewed label Mar 4, 2024

SanchithHegde mentioned this pull request Mar 6, 2024

[BUG] Outgoing webhook retry scheduler tasks remain in pending state if merchant webhook URL is not configured #3995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(webhooks): implement automatic retries for failed webhook deliveries using scheduler #3842

feat(webhooks): implement automatic retries for failed webhook deliveries using scheduler #3842

SanchithHegde commented Feb 27, 2024 •

edited

Loading

hrithikesh026 left a comment

	default_mapping: RetryMapping {
	// 1st attempt happens after 1 minute
	start_after: 60,

	frequency: vec![
	// 2nd and 3rd attempts happen at intervals of 5 minutes each
	60 * 5,
	// 4th, 5th, 6th, 7th and 8th attempts happen at intervals of 10 minutes each
	60 * 10,
	// 9th, 10th, 11th, 12th and 13th attempts happen at intervals of 1 hour each
	60 * 60,
	// 14th, 15th and 16th attempts happen at intervals of 6 hours each
	60 * 60 * 6,
	],
	count: vec![
	2, // 2nd and 3rd attempts
	5, // 4th, 5th, 6th, 7th and 8th attempts
	5, // 9th, 10th, 11th, 12th and 13th attempts
	3, // 14th, 15th and 16th attempts
	],
	},

feat(webhooks): implement automatic retries for failed webhook deliveries using scheduler #3842

feat(webhooks): implement automatic retries for failed webhook deliveries using scheduler #3842

Conversation

SanchithHegde commented Feb 27, 2024 • edited Loading

Type of Change

Description

Behavior (Initial Delivery)

Behavior (Automatic Retry)

Configuring Retry Intervals

Motivation and Context

How did you test it?

Checklist

hrithikesh026 left a comment

Choose a reason for hiding this comment

SanchithHegde commented Feb 27, 2024 •

edited

Loading