Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Svix worker does not work as expected #1511

Closed
ducpham01 opened this issue Nov 8, 2024 · 9 comments · Fixed by #1515
Closed

Svix worker does not work as expected #1511

ducpham01 opened this issue Nov 8, 2024 · 9 comments · Fixed by #1515
Assignees
Labels
server Issues regarding the server component

Comments

@ducpham01
Copy link

ducpham01 commented Nov 8, 2024

Bug Report

Version

V0.74

Platform

Ubuntu 20.04

Description

We run svix worker and api mode within one container, using RedisCluster and Postgres.
Once day, its worker stopped working despite still receiving API requests:

  • It logged 200 for requests and 202 for newly created messages.
  • It logged event sent > that put Task in queue
  • It does not log event recv <.
  • Checking database, Svix saved messages in message table but does not have equivalent message attempts.
  • There was no indication of error with database or RedisCluster.
    As a result, messages have not been delivered to our clients.

We managed to make Svix work again by restarting container, messages can be delivered to clients but there’s another problem.

Svix continuously shows

Error executing task: error returned from database: null value in column “id” of relation “messagedestination” violates not-null constraint

Checking database logs, it shows

INSERT INTO “messagedestination” VALUES (DEFAULT) RETURNING “id”

Failing row contains (null, null, null, null, null, null, null).

null value in column “id” of relation “messagedestination” violates not-null constraint.

When looking at source code, I find that there is only one place that insert into messagedestination table process_task. I think the variable destinations mapped from endpoints could be null.

I hope to get help with some questions:

  • Why svix worker stopped working without showing any abnormal sign? Is it fixed in newer versions of Svix?
  • What is the root cause for not-null constraint error above? Can it be reproduced locally? Is it fixed in newer versions of Svix?
@tu-pm
Copy link

tu-pm commented Nov 8, 2024

Basically the background worker that sends messages to endpoints just stopped working with no signs of failure, and after restarting the worker, it is repeatedly crashed due to some corrupt data.

@tasn
Copy link
Member

tasn commented Nov 8, 2024

Team is looking!

@svix-james
Copy link
Contributor

Are the errors preventing you from processing other messages in the queue? My expectation is that other items should still get processed even if some are failing due to the error you reported.

Note that in our more recent versions, we have support for a Redis DLQ that can allow sending repeatedly failing messages to a deadletter queue. This might help clear up the errors you're facing.

@ducpham01
Copy link
Author

Has upgrading to the latest version fixed it? 0.74 is 2 years old and I'm sure a lot has changed.

Here I have 2 problems. The latter one, that is, Error executing task: error returned from database: null value in column “id” of relation “messagedestination” violates not-null constraint is fixed in latest version (v1.39.0).

I can reproduce this error locally:

  • Set up two separate containers, one for API server, one for worker.
  • Create a message to API server while worker is down, then delete the destined endpoint
  • Restart the worker.
  • And voila, the error shows Error executing task: error returned from database: null value in column “id” of relation “messagedestination” violates not-null constraint

To my understanding, svix of my version doesn’t check for null endpoints before mapping with messagedestination, that results in null inserts into database. This behavior was fixed from v.1.4.12

@ducpham01
Copy link
Author

Are the errors preventing you from processing other messages in the queue? My expectation is that other items should still get processed even if some are failing due to the error you reported.

The error violating db constraint doesn’t prevent my svix from processing other messages, it’s just noisy, cpu and memory consuming.

Do you have any suggestion to find why svix worker stopped working unexpectedly without any sign of failure? I created a message and it didn’t send to my endpoint until the container was restarted.

@tu-pm
Copy link

tu-pm commented Nov 11, 2024

@svix-james

Are the errors preventing you from processing other messages in the queue?

No, other messages are handled properly when the worker is restarted.

We're able to reproduce the error by doing the following steps:

  1. Split the worker and api server into 2 containers and only run the api server
  2. Send a message to an endpoint
  3. Delete the endpoint
  4. Start worker to deliver the message => The worker process handling the task repeatedly fails with the reported error message

We're also able to pinpoint the problematic code that causes the problem in version 0.74 and are able to handle the failed tasks by undeleting the endpoint.

Now, the remaining mystery here is how the worker just suddenly stop working with no sign of errors, which is troublesome since it is the vital part that actually sends messages to our customer apis.

We're planning to upgrade to a newer version of Svix to see if this problem reappears, and we have following questions:

  1. Is there anything to look out for when upgrading a major version?
  2. What is your recommended svix version should we aim for?
  3. Is there any mechanism to detect if the worker is alive and processing the message? For example, is there any way to get an alert if the task queue length is not decreasing for an appropriate amount of time?
  4. How do we use the DLQ feature and how do we monitor and get alerts with troubled messages?

Thank you for your help.

@tasn tasn added the server Issues regarding the server component label Nov 11, 2024
@svix-james
Copy link
Contributor

Is there anything to look out for when upgrading a major version?

Just double-check release notes, which I believe you have already done.

What is your recommended svix version should we aim for?

I would upgrade to the latest version, though of course you should test in a non-production environment prior to deploying.

Is there any mechanism to detect if the worker is alive and processing the message? For example, is there any way to get an alert if the task queue length is not decreasing for an appropriate amount of time?

What we generally do in our own environments is run the API server inside of the worker container as well and probe the /health endpoint. This endpoint will do basic checks of queue, database, and cache health.

Beyond that, we publish some basic OTEL metrics that monitor depth of the various queues. You can monitor the svix.queue.* metrics (e.g., svix.queue.depth_main) to ensure that queues are draining properly.

How do we use the DLQ feature and how do we monitor and get alerts with troubled messages?

The DLQ feature is enabled by default in recent Svix versions and will move messages to a queue with a default name of {queue}_svix_dlq. You can monitor this queue through OTEL metrics as well -- svix.queue.depth_dlq -- and re-drive it as per this: https://github.com/svix/svix-webhooks?tab=readme-ov-file#re-driving-redis-dlq

Otherwise, you'll need to manually inspect messages in this queue to try and troubleshoot why they failed.

@svix-james
Copy link
Contributor

We're able to reproduce the error by doing the following steps:
Split the worker and api server into 2 containers and only run the api server
Send a message to an endpoint
Delete the endpoint
Start worker to deliver the message => The worker process handling the task repeatedly fails with the reported error message

Also,thanks for the repro! I'm able to reproduce this in our latest code and can confirm this is still an issue. We will have a fix shortly 🤞

@svix-james svix-james self-assigned this Nov 11, 2024
jaymell added a commit to jaymell/svix-webhooks that referenced this issue Nov 11, 2024
Ensure that we don't try to process a message for which endpoints
no longer exist. Fixes svix#1511.
@jaymell
Copy link
Contributor

jaymell commented Nov 11, 2024

@tu-pm We have released version 1.40.0 to address the issue: https://github.com/svix/svix-webhooks/releases/tag/v1.40.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
server Issues regarding the server component
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants