cli,server: disable latency jump detection with `start-single-node` and/or when using Docker on macOS #98066

knz · 2023-03-06T17:10:42Z

Describe the problem

When running inside a Docker container on macOS, the TCP stack has very irregular latencies. This causes spurious log messages for folk who are exploring / testing using cockroach start-single-node.

Like this:

2023-03-06 10:56:36 W230306 15:56:34.770789 623 2@rpc/clock_offset.go:226 ⋮ [n1,rnode=1,raddr=‹b17287e17171:26257›,class=default,heartbeat] 40  latency jump (prev avg 64.54ms, current 108.40ms)
2023-03-06 10:56:45 W230306 15:56:44.095660 623 2@rpc/clock_offset.go:226 ⋮ [n1,rnode=1,raddr=‹b17287e17171:26257›,class=default,heartbeat] 41  latency jump (prev avg 70.86ms, current 131.61ms)
2023-03-06 10:57:01 W230306 15:56:59.634686 623 2@rpc/clock_offset.go:226 ⋮ [n1,rnode=1,raddr=‹b17287e17171:26257›,class=default,heartbeat] 42  latency jump (prev avg 72.80ms, current 242.88ms)

To Reproduce

run cockroach start-single-node on macOS via the Docker image.

Expected behavior

We should avoid the spurious warnings in that case.

Note that this logging mechanism is otherwise desirable in production clusters. A latency jump from, say, 70ms to 242ms between nodes, or on a node connecting to itself via the loopback interface, is a MAJOR operational event and should be reported.

The anomalous situation here is caused by macOS. So if we make the situation better in that case, that should not be at the expense of proper network observability for everyone else.

Jira issue: CRDB-25060

The text was updated successfully, but these errors were encountered:

For months I've seen this misfire in nearly every single log line I've looked at, and I've had to grep it out in many L2 incidents. Maybe it works better when we suppress it for latencies <=50ms. Touches cockroachdb#96262. Fixes cockroachdb#98066. Epic: none Release note: None

98517: multitenant: add AdminMerge capability r=knz a=ecwall Fixes #95138 AdminMerge is currently only called by the system tenant even though it is named similarly to other Admin* functions so it does not need its own capability for now. This changes its required capability from noCapCheckNeeded to onlySystemTenant to prevent secondary tenants from calling it. Release note: None 98580: rpc: bump threshold for latency jump reporting r=erikgrinaker a=tbg For months I've seen this misfire in nearly every single log line I've looked at, and I've had to grep it out in many L2 incidents. Maybe it works better when we suppress it for latencies <=50ms. Touches #96262. Fixes #98066. Epic: none Release note: None 98688: colexecbase: fix a recently introduced bug with identity cast r=yuzefovich a=yuzefovich This commit fixes a recently introduced bug that can occur when we're randomizing `coldata.BatchSize()` (which we do in tests). In particular, we capped a global singleton at one batch size value, but later we can change it to a higher value, which could lead to index out of bounds. This is now fixed by always using the max batch size. Fixes: #98660. Release note: None 98700: opt: fix hoist of ANY comparison with tuples r=mgartner a=mgartner #### opt: fix hoist of ANY comparison with tuples Prior to this commit, when hoisting Any expressions like `<left> = ANY (SELECT <right> ...)`, we constructed `(IsNot <left|right> Null)` expressions which are equivalent to `<left|right> IS DISTINCT FROM NULL`. As discovered in #46675, these expressions have different behavior than `<left> IS NOT NULL` when `<left>` is a tuple. As a result, the hoisting transformations could construct invalid plans that cause incorrect results. This commit fixes the issue by using `IsTupleNotNull` expressions when `<left>` and `<right> are tupleq. Fixes #98691 Release note (bug fix): A bug has been fixes that caused incorrect results of ANY comparisons of tuples. For example, an expression like `(x, y) = ANY (SELECT a, b FROM t WHERE ...)` could return `true` instead of the correct result of `NULL` when `x` and `y` were `NULL`, or `a` and `b` were `NULL`. This could only occur if the subquery was correlated, i.e., it references columns from the outer part of the query. This bug was present since the cost-based optimizer was introduced in version 2.1. Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>

blathers-crl bot added the A-observability-inf label Mar 6, 2023

knz added the B-os-macos Issues specific to macOS. label Mar 6, 2023

tbg mentioned this issue Mar 14, 2023

rpc: bump threshold for latency jump reporting #98580

Merged

craig bot closed this as completed in 3fed3d4 Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli,server: disable latency jump detection with `start-single-node` and/or when using Docker on macOS #98066

cli,server: disable latency jump detection with `start-single-node` and/or when using Docker on macOS #98066

knz commented Mar 6, 2023 •

edited by cockroach-jira-scripts

Loading

cli,server: disable latency jump detection with start-single-node and/or when using Docker on macOS #98066

cli,server: disable latency jump detection with start-single-node and/or when using Docker on macOS #98066

Comments

knz commented Mar 6, 2023 • edited by cockroach-jira-scripts Loading

cli,server: disable latency jump detection with `start-single-node` and/or when using Docker on macOS #98066

cli,server: disable latency jump detection with `start-single-node` and/or when using Docker on macOS #98066

knz commented Mar 6, 2023 •

edited by cockroach-jira-scripts

Loading