Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/mixed-versions failed [attempting to append refresh spans after the tracked timestamp has moved forward] #122274

Closed
cockroach-teamcity opened this issue Apr 12, 2024 · 9 comments · Fixed by #123111
Assignees
Labels
branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 12, 2024

roachtest.schemachange/mixed-versions failed with artifacts on release-23.2 @ e61e140a75b2f7dde9f5410c1d90e557d009c437:

(mixedversion.go:592).Run: mixed-version test failure while running step 17 (run "run schemachange workload and validation in mixed version"): full command output in run_062406.146932602_n4_COCKROACHRANDOMSEED1.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/schemachange/mixed-versions/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-37787

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Apr 12, 2024
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Apr 12, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/mixed-versions failed with artifacts on release-23.2 @ 7d5f4ef0d54fe03f947c500c68f9859ec54376e7:

(mixedversion.go:592).Run: mixed-version test failure while running step 19 (run "run schemachange workload and validation in mixed version"): full command output in run_062716.670799808_n4_COCKROACHRANDOMSEED1.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/schemachange/mixed-versions/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@annrpom
Copy link
Contributor

annrpom commented Apr 15, 2024

both errors 🤔

Error: ***UNEXPECTED ERROR; Failed to generate a random operation: Collect: CollectRows: "WITH \"descriptors\" AS 
(SELECT\n\t\tdescriptor.id,\n\t\t\"parentSchemaID\" AS schema_id,\n\t\tnamespace.name AS 
name,\n\t\tcrdb_internal.pb_to_json('desc', descriptor) AS descriptor\n\tFROM system.descriptor\n\tJOIN 
system.namespace ON namespace.id = descriptor.id\n\tWHERE \"parentID\" = (SELECT id FROM system.namespace 
WHERE name = current_database() AND \"parentID\" = 0)\n\t),\n\"enums\" AS (SELECT id, schema_id, name, descriptor-
>'type' AS descriptor FROM descriptors WHERE descriptor ? 'type'),\n\"enum_members\" AS (SELECT *, 
jsonb_array_elements(descriptor->'enumMembers') AS member FROM 
enums)\nSELECT\n\t\t\t\tquote_ident(schema_id::REGNAMESPACE::TEXT) || '.' || quote_ident(name) AS 
name,\n\t\t\t\tquote_literal(member->>'logicalRepresentation') AS value,\n\t\t\t\tCOALESCE(member->>'direction' = 
'REMOVE', false) AS dropping,\n\t\t\t\tCOALESCE(json_array_length(descriptor->'referencingDescriptorIds') > 0, false) AS 
has_references\n\t\t\tFROM enum_members\n\t" []: ERROR: internal error: attempting to append refresh spans after the 
tracked timestamp has moved forward. batchTimestamp: 1712903047.798542923,0 refreshedTimestamp: 
1712903048.151257948,1 (SQLSTATE XX000)
Error: ***UNEXPECTED ERROR; Failed to generate a random operation: Collect: CollectRows: "WITH \"descriptors\" AS 
(SELECT\n\t\tdescriptor.id,\n\t\t\"parentSchemaID\" AS schema_id,\n\t\tnamespace.name AS 
name,\n\t\tcrdb_internal.pb_to_json('desc', descriptor) AS descriptor\n\tFROM system.descriptor\n\tJOIN 
system.namespace ON namespace.id = descriptor.id\n\tWHERE \"parentID\" = (SELECT id FROM system.namespace 
WHERE name = current_database() AND \"parentID\" = 0)\n\t),\n\"enums\" AS (SELECT id, schema_id, name, descriptor-
>'type' AS descriptor FROM descriptors WHERE descriptor ? 'type'),\n\"enum_members\" AS (SELECT *, 
jsonb_array_elements(descriptor->'enumMembers') AS member FROM 
enums)\nSELECT\n\t\t\t\tquote_ident(schema_id::REGNAMESPACE::TEXT) || '.' || quote_ident(name) AS 
name,\n\t\t\t\tquote_literal(member->>'logicalRepresentation') AS value,\n\t\t\t\tCOALESCE(member->>'direction' = 
'REMOVE', false) AS dropping,\n\t\t\t\tCOALESCE(json_array_length(descriptor->'referencingDescriptorIds') > 0, false) AS 
has_references\n\t\t\tFROM enum_members\n\t" []: ERROR: internal error: attempting to append refresh spans after the 
tracked timestamp has moved forward. batchTimestamp: 1713076039.563133728,0 refreshedTimestamp: 
1713076039.640271771,1 (SQLSTATE XX000)

@fqazi
Copy link
Collaborator

fqazi commented Apr 16, 2024

@cockroachdb/kv We are running into the following inside this workload:

ERROR: internal error: attempting to append refresh spans after the 
tracked timestamp has moved forward. batchTimestamp: 1713076039.563133728,0 refreshedTimestamp: 

@fqazi fqazi removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Apr 16, 2024
@rafiss rafiss changed the title roachtest: schemachange/mixed-versions failed roachtest: schemachange/mixed-versions failed [attempting to append refresh spans after the tracked timestamp has moved forward] Apr 22, 2024
@rafiss
Copy link
Collaborator

rafiss commented Apr 24, 2024

The error comes from here:

// assertRefreshSpansAtInvalidTimestamp returns an error if the timestamp at
// which a set of reads was performed is below the largest timestamp that this
// transaction has already refreshed to.
func (sr *txnSpanRefresher) assertRefreshSpansAtInvalidTimestamp(
readTimestamp hlc.Timestamp,
) error {
if readTimestamp.Less(sr.refreshedTimestamp) {
// This can happen with (illegal) concurrent txn use, but that's supposed to
// be detected by the gatekeeper interceptor.
return errors.AssertionFailedf("attempting to append refresh spans after the tracked"+
" timestamp has moved forward. batchTimestamp: %s refreshedTimestamp: %s",
errors.Safe(readTimestamp), errors.Safe(sr.refreshedTimestamp))
}
return nil
}

According to the comment, the gatekeeper interceptor should detect this error, but it looks like that isn't working, so checking if KV has any ideas why.

@nvanbenschoten
Copy link
Member

This is generally a symptom of misuse of the Txn API. We have seen it happen in the past if a RootTxn is used while there are outstanding LeafTxns.

@nvanbenschoten nvanbenschoten added T-sql-queries SQL Queries Team and removed T-kv KV Team labels Apr 25, 2024
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Apr 25, 2024
@yuzefovich
Copy link
Member

For reference, the query was:

WITH
	descriptors
		AS (
			SELECT
				descriptor.id,
				"parentSchemaID" AS schema_id,
				namespace.name AS name,
				crdb_internal.pb_to_json('desc', descriptor) AS descriptor
			FROM
				system.descriptor JOIN system.namespace ON namespace.id = descriptor.id
			WHERE
				"parentID" = (SELECT id FROM system.namespace WHERE name = current_database() AND "parentID" = 0)
		),
	enums AS (SELECT id, schema_id, name, descriptor->'type' AS descriptor FROM descriptors WHERE descriptor ? 'type'),
	enum_members AS (SELECT *, jsonb_array_elements(descriptor->'enumMembers') AS member FROM enums)
SELECT
	quote_ident(schema_id::REGNAMESPACE::STRING) || '.' || quote_ident(name) AS name,
	quote_literal(member->>'logicalRepresentation') AS value,
	COALESCE(member->>'direction' = 'REMOVE', false) AS dropping,
	COALESCE(json_array_length(descriptor->'referencingDescriptorIds') > 0, false) AS has_references
FROM
	enum_members;

for which we don't use the streamer API since we have a local plan node.

I doubt it's a regression (it's more likely that we just started running this query in the workload generator somewhat recently which exposed an existing bug), so I'm removing the release blocker label.

@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 25, 2024
@yuzefovich
Copy link
Member

Nathan in this has a hypothesis:

Maybe in the REGNAMESPACE cast, which I believe uses an internal executor in planner.ResolveOIDFromOID.

@yuzefovich yuzefovich self-assigned this Apr 25, 2024
@yuzefovich yuzefovich moved this from Triage to Active in SQL Queries Apr 25, 2024
@yuzefovich
Copy link
Member

I think I see the problem. (I reproduced the mixed-version state of the cluster when this problem happened, but I don't think it matters.) Here is the bundle I got
stmt-bundle-963463120295755777.zip

On 23.1 version we have the plan different from 23.2+ version. On 23.1 we have a main query within which we use the streamer API for the lookup join, so this forces us to use the LeafTxn for the main query. We also have cast to REGNAMESPACE within the main query which uses the internal executor, and we use planner's txn which is RootTxn to perform that internal query. Thus, we have a concurrent usage of LeafTxn in the main query and RootTxn in the internal query which is not allowed. We need to disable the usage of the Streamer API in such a case.

@yuzefovich
Copy link
Member

Hm, this might be quite tricky to solve properly. Effectively, we're running into #41992 where things like miscellaneous builtins can currently use the RootTxn at any point in time during query evaluation. I think for the time being I'll disable the streamer in this particular case (when we have a cast to Oid type), but it won't solve the problem more generally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants