sql: flake in TestExecBuild/autocommit due to non-deterministic `ResolveIntent` in KV batch #103772

knz · 2023-05-23T13:17:45Z

Describe the problem
Observed here

=== RUN   TestExecBuild_autocommit
initialized metamorphic constant "logictest-global-mvcc-range-tombstone" with value true
initialized metamorphic constant "logictest-use-mvcc-range-tombstones-for-point-deletes" with value true
I230523 11:14:31.113596 1 (gostd) rand.go:243  [-] 1  random seed: 443818906144832873
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/56e1e960ced543bbe2593e1496bb4696/logTestExecBuild_autocommit3512975457
    test_log_scope.go:79: use -show-logs to present logs inline
    logic.go:2940: let $rangeid = 63
    logic.go:2894:
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/12742/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/sql/opt/exec/execbuilder/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/opt/exec/execbuilder/testdata/autocommit:678: SELECT operation, message FROM [SHOW KV TRACE FOR SESSION]
        WHERE message       LIKE '%r63: sending batch%'
          AND message   NOT LIKE '%PushTxn%'
          AND message   NOT LIKE '%QueryTxn%'
          AND operation NOT LIKE '%async%'
        expected:
            dist sender send  r63: sending batch 2 CPut to (n1,s1):1
            dist sender send  r63: sending batch 2 Get to (n1,s1):1
            dist sender send  r63: sending batch 1 EndTxn to (n1,s1):1
        but found (query options: "") :
            dist sender send  r63: sending batch 2 CPut to (n1,s1):1
            dist sender send  r63: sending batch 2 Get to (n1,s1):1
            dist sender send  r63: sending batch 1 ResolveIntent to (n1,s1):1
            dist sender send  r63: sending batch 1 EndTxn to (n1,s1):1

Expected behavior

check why this ResolveIntent is in there (this is a new failure mode)
And potentially make the test impervious to it.

cc @nvanbenschoten @mgartner

Jira issue: CRDB-28194

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2023-05-23T15:33:24Z

This could be because of #103265. If the Get has to resolve a conflicting intent, it may now end up in the trace. We're already ignoring trace events from other forms of transaction contention. We can add ResolveIntent in there as well.

103148: server: server identity returns nodeID for system tenant r=aadityasondhi a=dhartunian Previously, the server identity object would hide the nodeID from identity objects that contained a set tenantID. This was done to create separation of concerns between a secondary tenant and the rest of the cluster. However, this identity is used in populating log payloads with contextual metadata. That led to #103112 due to the fact that the system tenant sets its tenantId to 1 which led to a hidden nodeID. We now always emit the nodeID when the tenantID is set to the system tenant. Fixes #103112 Release note (bug fix): 23.1.0 contained a bug where the `node_id` field would be omitted in logs. This fix restores that value. 103709: server,jobs: Better handle node drain r=miretskiy a=miretskiy Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain Epic: CRDB-26978 Release note: None 103781: rules: add deprecation notices, remove op-rule tests r=postamar a=postamar This commit is a follow-up to #96769 which removes superfluous test definitions. Informs #96769. Release note: None 103792: sql/opt: deflake TestExecBuild/autocommit r=knz a=nvanbenschoten Fixes #103772. The test became flaky after 3fc29ad, which causes intent resolution to end up in foreground requests' traces. Release note: None Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Marius Posta <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

For context, the automatic test tenant machinery is currently dependent on a CCL enterprise license check. (This is fundamentally not necessary - see cockroachdb#103772 - but sadly this is the way it is for now) Prior to this patch, if the user or a test selected the creation of a test tenant, but the test code forgot to import the required CCL go package, the framework would announce that "a test tenant was created" but it was actually silently failing to do so. This led to confusing investigations where a test tenant was expected, a test was appearing to succeed, but with a release build the same condition would fail. This commit enhances the situation by ensuring we have clear logging output when the test tenant cannot be created due to the missing CCL import. Release note: None

106738: logic: skip_on_retry works when errors are expected r=Xiang-Gu a=Xiang-Gu Previously, we have `skip_on_retry` directive for logic test which, when set, it skips the rest of test if a statement fails with TransactionRetryError. However, it won't skip if the statement is expected to fail with certain error message. This PR ensures that whenever we have a TransactionRetryError and `skip_on_retry` is set, we always skip the rest of the test, even if the stmt is expected to fail. fixes #104464 Release note: None 106759: streamingccl: unskip TestStreamDeleteRange r=msbutler a=stevendanna This test had previously timed out. The timeout we saw was the result of a couple of issues. When waiting for all delete ranges, our loop exit condition was very strict. We would only stop looking for rows if the number of delete ranges was exactly 3. If, however, we got 4 delete ranges, with 2 coming in a single batch, we would never hit this condition. How would that happen though? One possibility are rangefeed duplicates. Another, and what appears to have been happening in this test, is that the representation of the range deletes observed by the rangefeed consumer is slightly different depending on whether the range delete is delivered as part of a catch-up scan or as part of the rangefeeds steady state. I believe this is because the range deletes overlap but are issued at different time points. When we get them as part of the steady state, we get a trimmed version of the original event. When we get them as part of the catch-up scan, we get them broke up at the point of overlap. Fixes #93568 Epic: none Release note: None 106814: testutils: add helper to target transactions for retries r=lidorcarmel a=stevendanna This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs #106417 Epic: none Release note: none 106822: spanconfigccl: remove uses of `TODOTestTenantDisabled` r=stevendanna a=knz Informs #76378 . Epic: CRDB-18499 There's a mix of tests that control their tenants directly, and tests that should really work with virtualization enabled but don't. Followup issues: #106821 and #106818. Release note: None 106832: server: bark loudly if the test tenant cannot be created r=herkolategan a=knz Informs #76378 Informs #103772. Epic: CRDB-18499 For context, the automatic test tenant machinery is currently dependent on a CCL enterprise license check. (This is fundamentally not necessary - see #103772 - but sadly this is the way it is for now) Prior to this patch, if the user or a test selected the creation of a test tenant, but the test code forgot to import the required CCL go package, the framework would announce that "a test tenant was created" but it was actually silently failing to do so. This led to confusing investigations where a test tenant was expected, a test was appearing to succeed, but with a release build the same condition would fail. This commit enhances the situation by ensuring we have clear logging output when the test tenant cannot be created due to the missing CCL import. Release note: None Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

For context, the automatic test tenant machinery is currently dependent on a CCL enterprise license check. (This is fundamentally not necessary - see cockroachdb#103772 - but sadly this is the way it is for now) Prior to this patch, if the user or a test selected the creation of a test tenant, but the test code forgot to import the required CCL go package, the framework would announce that "a test tenant was created" but it was actually silently failing to do so. This led to confusing investigations where a test tenant was expected, a test was appearing to succeed, but with a release build the same condition would fail. This commit enhances the situation by ensuring we have clear logging output when the test tenant cannot be created due to the missing CCL import. Release note: None

blathers-crl bot added the T-sql-queries SQL Queries Team label May 23, 2023

knz mentioned this issue May 23, 2023

sql: fix the index/table span generation in SHOW RANGES #103667

Merged

nvanbenschoten self-assigned this May 23, 2023

nvanbenschoten mentioned this issue May 23, 2023

sql/opt: deflake TestExecBuild/autocommit #103792

Merged

craig bot closed this as completed in 1f16ff2 May 23, 2023

exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label May 23, 2023

knz mentioned this issue Jul 14, 2023

server: bark loudly if the test tenant cannot be created #106832

Merged

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: flake in TestExecBuild/autocommit due to non-deterministic `ResolveIntent` in KV batch #103772

sql: flake in TestExecBuild/autocommit due to non-deterministic `ResolveIntent` in KV batch #103772

knz commented May 23, 2023 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented May 23, 2023

sql: flake in TestExecBuild/autocommit due to non-deterministic ResolveIntent in KV batch #103772

sql: flake in TestExecBuild/autocommit due to non-deterministic ResolveIntent in KV batch #103772

Comments

knz commented May 23, 2023 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented May 23, 2023

sql: flake in TestExecBuild/autocommit due to non-deterministic `ResolveIntent` in KV batch #103772

sql: flake in TestExecBuild/autocommit due to non-deterministic `ResolveIntent` in KV batch #103772

knz commented May 23, 2023 •

edited by cockroach-jira-scripts

Loading