Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logictest: schema_change_in_txn failed with TransactionRetryError #104464

Closed
cockroach-teamcity opened this issue Jun 7, 2023 · 12 comments · Fixed by #106738
Closed

logictest: schema_change_in_txn failed with TransactionRetryError #104464

cockroach-teamcity opened this issue Jun 7, 2023 · 12 comments · Fixed by #106738
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 7, 2023

pkg/sql/logictest/tests/local-legacy-schema-changer/local-legacy-schema-changer_test.TestLogic_schema_change_in_txn failed with artifacts on release-23.1 @ 1a5dc2aad16ca4185afcda75d923804324dcfeb9:

resident:                     8486912
abandoned_vm:                       0
extent_avail:                       0
                           n_lock_ops (#/sec)       n_waiting (#/sec)      n_spin_acq (#/sec)  n_owner_switch (#/sec)   total_wait_ns   (#/sec)     max_wait_ns  max_n_thds
large                              26       0               0       0               0       0              11       0               0         0               0           0
extent_avail                       30       0               0       0               0       0              13       0               0         0               0           0
extents_dirty                      29       0               0       0               0       0              13       0               0         0               0           0
extents_muzzy                      26       0               0       0               0       0              11       0               0         0               0           0
extents_retained                   30       0               0       0               0       0              13       0               0         0               0           0
decay_dirty                        28       0               0       0               0       0              12       0               0         0               0           0
decay_muzzy                        26       0               0       0               0       0              11       0               0         0               0           0
base                               55       0               0       0               0       0              12       0               0         0               0           0
tcache_list                        26       0               0       0               0       0              11       0               0         0               0           0
hpa_shard                           0       0               0       0               0       0               0       0               0         0               0           0
hpa_shard_grow                      0       0               0       0               0       0               0       0               0         0               0           0
hpa_sec                             0       0               0       0               0       0               0       0               0         0               0           0
bins:           size ind    allocated      nmalloc (#/sec)      ndalloc (#/sec)    nrequests   (#/sec)  nshards      curregs     curslabs  nonfull_slabs regs pgs   util       nfills (#/sec)     nflushes (#/sec)       nslabs     nreslabs (#/sec)      n_lock_ops (#/sec)       n_waiting (#/sec)      n_spin_acq (#/sec)  n_owner_switch (#/sec)   total_wait_ns   (#/sec)     max_wait_ns  max_n_thds
                     ---
large:          size ind    allocated      nmalloc (#/sec)      ndalloc (#/sec)    nrequests (#/sec)  curlextents
                     ---
             8388608  72      8388608            2       0            1       0            2       0            1
                     ---
extents:        size ind       ndirty        dirty       nmuzzy        muzzy    nretained     retained       ntotal        total
                     ---
             2097152  31            0            0            0            0            1      2093056            1      2093056
                     ---
Bytes in small extent cache: 0
HPA shard stats:
  Purge passes: 0 (0 / sec)
  Purges: 0 (0 / sec)
  Hugeifies: 0 (0 / sec)
  Dehugifies: 0 (0 / sec)

  In full slabs:
      npageslabs: 0 huge, 0 nonhuge
      nactive: 0 huge, 0 nonhuge 
      ndirty: 0 huge, 0 nonhuge 
      nretained: 0 huge, 0 nonhuge 
  In empty slabs:
      npageslabs: 0 huge, 0 nonhuge
      nactive: 0 huge, 0 nonhuge 
      ndirty: 0 huge, 0 nonhuge 
      nretained: 0 huge, 0 nonhuge 

                size ind npageslabs_huge    nactive_huge     ndirty_huge  npageslabs_nonhuge     nactive_nonhuge      ndirty_nonhuge   nretained_nonhuge
                     ---
--- End jemalloc statistics ---
[05:20:19] --- total progress: 191 statements
--- total: 191 tests, 2 failures
    --- FAIL: TestLogic_schema_change_in_txn/rollback_mutations (0.76s)
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-28541

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jun 7, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 7, 2023
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jun 7, 2023
@DrewKimball
Copy link
Collaborator

Error from the logs:

=== RUN   TestLogic_schema_change_in_txn/rollback_mutations
[05:20:13] INSERT INTO customers (k) VALUES ('z'), ('x');
[05:20:13] BEGIN;
[05:20:13] ALTER TABLE customers ADD i INT DEFAULT 5;
[05:20:13] ALTER TABLE customers ADD j INT DEFAULT 4;
[05:20:13] ALTER TABLE customers ADD l INT DEFAULT 3;
[05:20:13] ALTER TABLE customers ADD m CHAR;
[05:20:13] ALTER TABLE customers ADD n CHAR DEFAULT 'a';
[05:20:13] CREATE INDEX j_idx ON customers (j);
[05:20:13] CREATE INDEX l_idx ON customers (l);
[05:20:13] CREATE INDEX m_idx ON customers (m);
[05:20:13] CREATE UNIQUE INDEX i_idx ON customers (i);
[05:20:13] CREATE UNIQUE INDEX n_idx ON customers (n);
[05:20:13] SELECT status, started, job_type FROM crdb_internal.jobs WHERE status='pending';
[05:20:13] 	-- OK;
[05:20:13] COMMIT;
[05:20:13] 	-- OK;
[05:20:13] 	-- FAIL
    logic.go:2548: 
         
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/3889/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-legacy-schema-changer/local-legacy-schema-changer_test_/local-legacy-schema-changer_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/schema_change_in_txn:726: 
        expected:
        violates unique constraint
        got:
        pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key /Table/15/1/100/2/1): "sql txn" meta={id=963a1c7c key=/Table/3/1/132/2/1 pri=0.00284854 epo=0 ts=1686115213.321064704,1 min=1686115213.230630928,0 seq=21} lock=true stat=PENDING rts=1686115213.230630928,0 wto=false gul=1686115213.730630928,0
        HINT: See: https://www.cockroachlabs.com/docs/v23.1/transaction-retry-error-reference.html#retry_serializable
[05:20:13] 	-- FAIL

@DrewKimball DrewKimball changed the title pkg/sql/logictest/tests/local-legacy-schema-changer/local-legacy-schema-changer_test: TestLogic_schema_change_in_txn failed logictest: schema_change_in_txn/rollback_mutations failed with TransactionRetryError Jun 7, 2023
@DrewKimball DrewKimball changed the title logictest: schema_change_in_txn/rollback_mutations failed with TransactionRetryError logictest: schema_change_in_txn failed with TransactionRetryError Jun 7, 2023
@DrewKimball
Copy link
Collaborator

Table 3 (the one with the retry error) is the descriptor table.

@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 13, 2023
ParkMyCar added a commit to MaterializeInc/materialize that referenced this issue Jun 15, 2023
### Motivation

Hopefully fixes:
https://github.com/MaterializeInc/materialize/issues/19931

There seems to be an error in the latest version of CockroachDB, see
similar issue: cockroachdb/cockroach#104464.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration, there is a
companion cloud PR to account for those changes that is tagged with the
release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [x] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
  - N/a
@ParkMyCar
Copy link

ParkMyCar commented Jun 15, 2023

Hey folks! We recently upgraded Cockroach in our CI to v23.1.1 and since then have seen sporadic failures similar to the one here. For example:

RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key /Table/3/1/100/2/1): "sql txn" meta={id=88047081 key=/Table/3/1/652/2/1 pri=0.00575911 epo=0 ts=1686844605.975982896,1 min=1686844605.887256703,0 seq=20} lock=true stat=PENDING rts=1686844605.887256703,0 wto=false gul=1686844606.387256703,0

any thoughts on what might be causing this?


Edit: It seems like this issue might be related to #104728 ?

@DrewKimball
Copy link
Collaborator

Hi @ParkMyCar, this is more of an issue with the test than anything else, but you can refer to our docs to diagnose the retry error you're seeing: https://www.cockroachlabs.com/docs/stable/transaction-retry-error-reference.html#retry_serializable

@benesch
Copy link
Contributor

benesch commented Jun 15, 2023

Hey @DrewKimball! I’m the one who pointed @ParkMyCar at this issue based on the similarity in the implicated keys. I haven’t actually dug into the Git history for this test, but looking at its name it’s for the legacy schema changer, and presumably is therefore quite old?

What has my spidey sense tingling is that this test is doing a schema change and then a read in the same transaction, which is exactly what we do in Materialize CI when we initialize a new stash. And the conflict in both is on a key in the descriptor table.

Possible we’re barking up the wrong tree here, but I’m really wondering if there was a regression in v23.1 that’s leading to both the failure here and in our CI.

Does #104728 look like it’s related to this test failure? Would it be better to continue discussion on that issue?

@DrewKimball
Copy link
Collaborator

Hey @DrewKimball! I’m the one who pointed @ParkMyCar at this issue based on the similarity in the implicated keys.

Oh I see, I didn't notice it was on the same table, sorry about that. I think it makes sense to keep the discussion here. #104728 could be related, but there are multiple possible reasons why a SERIALIZABLE error could happen, so we can't say if that's it for sure yet.

@DrewKimball
Copy link
Collaborator

@ParkMyCar I looked into this a bit further - in this test failure the conflicting write was to the jobs table, whereas in the error you provided it was to the descriptor table, which indicates probable contending schema changes. Creating, altering, or dropping a schema in defaultdb would explain the conflicting write, as would changing privileges on the database. Is it possible your application is doing one of these things often enough to explain the retry errors?

@ParkMyCar
Copy link

Thanks for investigating further @DrewKimball! For each test we do create a new random schema, e.g. schema_{seed}, to provide some isolation between test cases, but we don't create the schema within a transaction.

But after creating the new schema, we do start a transaction, create tables, and then alter the gc.ttlseconds for a few of these tables, then finally attempt to commit the transaction. Could that be the culprit?

We're not changing privileges anywhere, we do set default_transaction_isolation to serializable though.

@DrewKimball
Copy link
Collaborator

Thanks for investigating further @DrewKimball! For each test we do create a new random schema, e.g. schema_{seed}, to provide some isolation between test cases, but we don't create the schema within a transaction.

But after creating the new schema, we do start a transaction, create tables, and then alter the gc.ttlseconds for a few of these tables, then finally attempt to commit the transaction. Could that be the culprit?

Are your tests run concurrently? Also, do you know where you're seeing these retry errors - schema creation or table creation? If the former and tests are run concurrently, it sounds likely that the schema setups for each test are contending with one another. If the latter, we'll have to keep digging because it's unexpected that creating/altering tables would touch the defaultdb descriptor.

@rafiss
Copy link
Collaborator

rafiss commented Jun 20, 2023

As far as deflaking the CRDB logictest, we can use the skip_on_retry directive for this test, similar to #100695

@rafiss
Copy link
Collaborator

rafiss commented Jun 20, 2023

@benesch @ParkMyCar To make it easier for us to help with your issue, would you mind opening a new Github issue with steps on how to reproduce the problem?

@ParkMyCar
Copy link

Hey @rafiss! I'm working on generating a minimum repro, my goal is to have something by tomorrow. At that point I'll open up a new issue and tag it here.

P.S. Your GitHub picture is great!

@yuzefovich yuzefovich removed the T-sql-queries SQL Queries Team label Jun 28, 2023
Xiang-Gu added a commit to Xiang-Gu/cockroach that referenced this issue Jul 13, 2023
Previously, we have `skip_on_retry` directive for logic test which, when
set, it skips the rest of test if a statement fails with
TransactionRetryError. However, it won't skip if the statement is
expected to fail with certain error message. This PR ensures that
whenever we have a TransactionRetryError and `skip_on_retry` is set, we
always skip the rest of the test, even if the stmt is expected to fail.

Informs cockroachdb#104464

Release note: None
@craig craig bot closed this as completed in b5e9cfd Jul 14, 2023
blathers-crl bot pushed a commit that referenced this issue Jul 14, 2023
Previously, we have `skip_on_retry` directive for logic test which, when
set, it skips the rest of test if a statement fails with
TransactionRetryError. However, it won't skip if the statement is
expected to fail with certain error message. This PR ensures that
whenever we have a TransactionRetryError and `skip_on_retry` is set, we
always skip the rest of the test, even if the stmt is expected to fail.

Informs #104464

Release note: None
msbutler pushed a commit to msbutler/cockroach that referenced this issue Jul 17, 2023
Previously, we have `skip_on_retry` directive for logic test which, when
set, it skips the rest of test if a statement fails with
TransactionRetryError. However, it won't skip if the statement is
expected to fail with certain error message. This PR ensures that
whenever we have a TransactionRetryError and `skip_on_retry` is set, we
always skip the rest of the test, even if the stmt is expected to fail.

Informs cockroachdb#104464

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants