`redis.exception.WatchError` is not really an error #2639

chrisguidry · 2024-06-26T20:14:48Z

What problem do you want to solve?

When upgrading opentelemetry-instrumentation-redis from 0.39b0 to 0.46b0, we picked up the (very nice) fixes to instrumentation of the async redis client. However, all of our arq (https://github.com/samuelcolvin/arq/) workers started marking tons of traces as errors because they got a WatchError exception.

The redis WatchError is not really an error, it's a flow-control and concurrency-management mechanism (kind of like a LockError, etc). The fact that we're marking all of those spans as errored introduces a lot of noise to something that's a totally reasonable pattern and use of redis.

Here's an example of how it's used for concurrency control: https://github.com/samuelcolvin/arq/blob/main/arq/connections.py#L178C20-L180

Describe the solution you'd like

I'd recommend that we adjust the instrumentation to not mark spans as failed when they encounter WatchError (or to at least make this behavior configurable). It does make sense to still mark them as events on the current span, though.

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

None

The text was updated successfully, but these errors were encountered:

xrmx · 2024-06-27T08:09:25Z

Yeah, makes sense.

zhihali · 2024-06-28T19:44:51Z

May I take this one?

xrmx · 2024-06-29T13:02:21Z

@Charlie-lizhihan sure

zhihali · 2024-07-04T13:42:20Z

Hi, @xrmx @chrisguidry
When an WatchError occur, the current instrumentation will set span as

{
    "name": "SET",
    "context": {
        "trace_id": "0x878718c68c75dbc45f875358f21f1820",
        "span_id": "0x4302e43a40344ddd",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": null,
    "start_time": "2024-07-04T13:37:05.133151Z",
    "end_time": "2024-07-04T13:37:05.136076Z",
    "status": {
        "status_code": "ERROR",
        "description": "WatchError: Watched variable changed."
    },
    "attributes": {
        "db.statement": "SET ? ?",
        "db.system": "redis",
        "db.redis.database_index": 0,
        "net.peer.name": "localhost",
        "net.peer.port": 6379,
        "net.transport": "ip_tcp",
        "db.redis.pipeline_length": 1
    },
    "events": [
        {
            "name": "exception",
            "timestamp": "2024-07-04T13:37:05.136058Z",
            "attributes": {
                "exception.type": "redis.exceptions.WatchError",
                "exception.message": "Watched variable changed.",
                "exception.stacktrace": "Traceback (most recent call last):\n  File.....",
                "exception.escaped": "False"
            }
        }
    ],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.25.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}

My idea is to change the
Status Code: The status_code should be "UNSET" instead of "ERROR"

The updated span will looks like

{
    "name": "SET",
    "context": {
        "trace_id": "0xa3dfeee4a56d35db967d5009c1cce330",
        "span_id": "0x0843393d6d86562d",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": null,
    "start_time": "2024-07-04T10:19:59.647866Z",
    "end_time": "2024-07-04T10:19:59.651123Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "db.statement": "SET ? ?",
        "db.system": "redis",
        "db.redis.database_index": 0,
        "net.peer.name": "localhost",
        "net.peer.port": 6379,
        "net.transport": "ip_tcp",
        "db.redis.pipeline_length": 1
    },
    "events": [
        {
            "name": "exception",
            "timestamp": "2024-07-04T10:19:59.651113Z",
            "attributes": {
                "exception.type": "redis.exceptions.WatchError",
                "exception.message": "Watched variable changed.",
                "exception.stacktrace": "Traceback (most recent call last):\n  File......",
                "exception.escaped": "False"
            }
        }
    ],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.25.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}

Does this sounds good?

chrisguidry · 2024-07-04T15:19:06Z

That sounds perfect to me! Thank you for taking this one on

emdneto · 2024-07-15T22:42:16Z

@chrisguidry, would you care to give it a try with the fix from #2668?

chrisguidry · 2024-07-16T17:30:17Z

Will do, likely a little later this week

chrisguidry · 2024-07-17T13:51:32Z

Hi @emdneto I tested and left a note on the PR, I believe we're swallowing the error, but we should really be re-raising it

In open-telemetry#2668, we started to avoid having the `redis.WatchError` mark the span as having failed, but we were inadvertently suppressing that exception from the calling code. `redis.WatchError` is used for flow-of-control concurrency in many applications, and applications depend on catching the error to handle concurrent changes to keys during redis pipelines. This re-raises the `WatchError` to keep the instrumentation transparent to the application. Fixes open-telemetry#2639

chrisguidry added the feature-request label Jun 26, 2024

xrmx added the good first issue Good for newcomers label Jun 27, 2024

zhihali mentioned this issue Jul 5, 2024

Handle redis.exceptions.WatchError as a non-error event in instrumentation #2668

Merged

10 tasks

emdneto closed this as completed Jul 15, 2024

zhihali mentioned this issue Jul 16, 2024

REQUEST: New membership for zhihali open-telemetry/community#2214

Closed

6 tasks

chrisguidry mentioned this issue Jul 17, 2024

Re-raise redis.WatchErrors when they occur #2721

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`redis.exception.WatchError` is not really an error #2639

`redis.exception.WatchError` is not really an error #2639

chrisguidry commented Jun 26, 2024

xrmx commented Jun 27, 2024

zhihali commented Jun 28, 2024

xrmx commented Jun 29, 2024

zhihali commented Jul 4, 2024 •

edited

Loading

chrisguidry commented Jul 4, 2024

emdneto commented Jul 15, 2024

chrisguidry commented Jul 16, 2024

chrisguidry commented Jul 17, 2024

redis.exception.WatchError is not really an error #2639

redis.exception.WatchError is not really an error #2639

Comments

chrisguidry commented Jun 26, 2024

What problem do you want to solve?

Describe the solution you'd like

Describe alternatives you've considered

Additional Context

Would you like to implement a fix?

xrmx commented Jun 27, 2024

zhihali commented Jun 28, 2024

xrmx commented Jun 29, 2024

zhihali commented Jul 4, 2024 • edited Loading

chrisguidry commented Jul 4, 2024

emdneto commented Jul 15, 2024

chrisguidry commented Jul 16, 2024

chrisguidry commented Jul 17, 2024

`redis.exception.WatchError` is not really an error #2639

`redis.exception.WatchError` is not really an error #2639

zhihali commented Jul 4, 2024 •

edited

Loading