Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(webhooks): implement automatic retries for failed webhook deliveries using scheduler #3842

Merged
merged 36 commits into from
Mar 4, 2024

Conversation

SanchithHegde
Copy link
Member

@SanchithHegde SanchithHegde commented Feb 27, 2024

Type of Change

  • Bugfix
  • New feature
  • Enhancement
  • Refactoring
  • Dependency updates
  • Documentation
  • CI/CD

Description

This PR adds support for automatically retrying outgoing webhook deliveries in case of failed deliveries, with the aid of the scheduler (process tracker).

Behavior (Initial Delivery)

The behavior when an outgoing webhook is being sent is:

  1. During the initial webhook delivery attempt, a process tracker task is added before sending the webhook, since we launch the webhook delivery task on a background thread that we have no control over. The scheduled time for the task is determined from configuration (persisted in database/Redis), which has been explained later.

  2. Next, we trigger the webhook to the merchant and an analytics event is raised, as before.

    1. In case the initial delivery attempt succeeds, the process tracker task is marked completed. The end.
    2. In case the initial delivery attempt fails, no further action is taken, the webhook delivery will be attempted again by the process tracker at the scheduled time.

Behavior (Automatic Retry)

  1. During the automatic retry attempt(s) from process tracker, the current information about the resource (about which the webhook is being sent) is fetched from the database and will be included in the webhook payload.

  2. If the status of the resource is different than when the event for the webhook was created (the resource has now transitioned to another status), then the process tracker task is aborted with business status indicating the status mismatch.

  3. When triggering the webhook to the merchant:

    1. If the retry attempt succeeds, the process tracker task is marked completed.
    2. If the retry attempt fails, the process tracker task is scheduled again for retry at the scheduled time determined from configuration.
    3. If the number of retries exceeds the maximum number of retries, then the process tracker task is marked completed with business status indicating that retries exceeded.

Configuring Retry Intervals

The retry intervals for webhook deliveries are determined by runtime configuration (persisted in the database/Redis), using the key pt_mapping_outgoing_webhooks. In case the configuration value is not available in storage, a default configuration value is used, which is harcoded in the application:

default_mapping: RetryMapping {
// 1st attempt happens after 1 minute
start_after: 60,
frequency: vec![
// 2nd and 3rd attempts happen at intervals of 5 minutes each
60 * 5,
// 4th, 5th, 6th, 7th and 8th attempts happen at intervals of 10 minutes each
60 * 10,
// 9th, 10th, 11th, 12th and 13th attempts happen at intervals of 1 hour each
60 * 60,
// 14th, 15th and 16th attempts happen at intervals of 6 hours each
60 * 60 * 6,
],
count: vec![
2, // 2nd and 3rd attempts
5, // 4th, 5th, 6th, 7th and 8th attempts
5, // 9th, 10th, 11th, 12th and 13th attempts
3, // 14th, 15th and 16th attempts
],
},

In case the default configuration needs to be overridden, it can be done so using the configuration create (or update) endpoint:

curl --location 'http://localhost:8080/configs/' \
  --header 'Content-Type: application/json' \
  --header 'Accept: application/json' \
  --header 'api-key: <ADMIN_API_KEY>' \
  --data '{
    "key": "pt_mapping_outgoing_webhooks",
    "value": "{\"default_mapping\":{\"start_after\":60,\"frequency\":[15,30],\"count\":[2,3]},\"custom_merchant_mapping\":{}}"
  }'

This would override the configuration for all merchants. Note that frequency and start_after are specified in seconds. If it needs to be overridden only for a specific merchant, then the custom_merchant_mapping field must have a similar object as default_mapping, keyed by the merchant ID:

{
  "default_mapping": {
    "start_after": 60,
    "frequency": [15, 30],
    "count": [2, 3]
  },
  "custom_merchant_mapping": {
    "merchant_id1": {
      "start_after": 30,
      "frequency": [300],
      "count": [2]
    }
  }
}

Motivation and Context

Closes #217.

How did you test it?

As of now, outgoing webhooks are supported for payments, refunds, disputes and mandates. I've extensively tested payments outgoing webhooks for the different cases, and done a sanity testing on disputes and refunds webhooks to verify that they are retried in case the initial delivery attempt fails. Incoming mandates webhooks are integrated for the GoCardless connector, but the integration is broken and I couldn't test mandates webhook retries.

As for simulating failed webhook deliveries, the merchant webhook URL would have to be configured to a URL which does not accept POST requests. After a couple of failed retries, the URL can be updated to a valid URL to try out the case where the retried delivery attempt succeeds.

Since the hardcoded default retry configuration spans multiple hours, I configured the application to use shorter intervals:

curl --location 'http://localhost:8080/configs/' \
  --header 'Content-Type: application/json' \
  --header 'Accept: application/json' \
  --header 'api-key: <ADMIN_API_KEY>' \
  --data '{
    "key": "pt_mapping_outgoing_webhooks",
    "value": "{\"default_mapping\":{\"start_after\":60,\"frequency\":[15,30],\"count\":[2,3]},\"custom_merchant_mapping\":{}}"
  }'

The process_tracker table can be queried for relevant record using this query:

SELECT * FROM process_tracker WHERE runner = 'OUTGOING_WEBHOOK_RETRY_WORKFLOW' ORDER BY schedule_time DESC;
  1. If the initial delivery attempt is successful, the business status of the process tracker entry is set to INITIAL_DELIVERY_ATTEMPT_SUCCESSFUL:

    Screenshot of process tracker entry where initial delivery attempt was successful

  2. If one of the retried delivery attempts are successful, the business status of the process tracker entry is set to COMPLETED_BY_PT:

    Screenshot of process tracker entry where retried delivery attempt was successful

  3. If none of the delivery attempts are successful, the business status of the process tracker entry is set to RETRIES_EXCEEDED:

    Screenshot of process tracker entry where retries failed

    In my case, the configured count was [2,3], so the maximum number of retries turns out to be 2 + 3 = 5.

As for testing refunds and disputes webhooks, the screenshots are attached below:

  1. Refunds (note the event_type and event_class fields):

    Screenshot of failed refund webhook being retried

  2. Disputes (note the event_type and event_class fields), the business status is RESOURCE_STATUS_MISMATCH since the dispute status changed since the time it was created to the time the webhook was being retried:

    Screenshot of dispute webhook being aborted due to status mismatch

In addition, successful and failed task additions should raise appropriate metrics (TASKS_ADDED_COUNT and TASK_ADDITION_FAILURES_COUNT respectively), and suitable logs are being thrown when delivering webhooks to the merchant fails, and when retry configs are being read from the database.

The PR also adds unit tests for a scheduler utility function I refactored, which can be run using the command:

cargo test --package scheduler --lib -- utils::tests

Screenshot of unit tests run

Checklist

  • I formatted the code cargo +nightly fmt --all
  • I addressed lints thrown by cargo clippy
  • I reviewed the submitted code
  • I added unit tests for my changes where possible
  • I added a CHANGELOG entry if applicable

…event_and_trigger_outgoing_webhook()` function
The visibility of `create_event_and_trigger_outgoing_webhook()` and `trigger_webhook_to_merchant()` functions is now private instead of public.
… of `OutgoingWebhookEventMetric` trait to `get_outgoing_webhook_event_content()`
…onnectorPTMapping` and `PaymentMethodsPTMapping` types
…d for 2nd retry attempt instead of 1st retry attempt
…in database when retrying webhooks from scheduler
@SanchithHegde SanchithHegde added the A-core Area: Core flows label Feb 27, 2024
sahkal
sahkal previously approved these changes Feb 28, 2024
hrithikesh026
hrithikesh026 previously approved these changes Feb 28, 2024
Copy link
Contributor

@hrithikesh026 hrithikesh026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sahkal
sahkal previously approved these changes Feb 29, 2024
hrithikesh026
hrithikesh026 previously approved these changes Feb 29, 2024
Narayanbhat166
Narayanbhat166 previously approved these changes Mar 1, 2024
crates/scheduler/src/utils.rs Outdated Show resolved Hide resolved
@SanchithHegde SanchithHegde requested a review from a team as a code owner March 3, 2024 14:09
@Gnanasundari24 Gnanasundari24 added this pull request to the merge queue Mar 4, 2024
Merged via the queue into main with commit 5bb67c7 Mar 4, 2024
13 of 15 checks passed
@Gnanasundari24 Gnanasundari24 deleted the outgoing-webhooks-automatic-retry branch March 4, 2024 06:45
@SanchithHegde SanchithHegde removed the S-waiting-on-review Status: This PR has been implemented and needs to be reviewed label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-core Area: Core flows A-process-tracker Area: Process tracker A-webhooks Area: Webhook flows C-feature Category: Feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Schedule webhook for retry
7 participants