Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/upgrade/upgrademanager/upgrademanager_test: TestMigrationFailure failed #106648

Closed
cockroach-teamcity opened this issue Jul 12, 2023 · 15 comments · Fixed by #108029
Closed

pkg/upgrade/upgrademanager/upgrademanager_test: TestMigrationFailure failed #106648

cockroach-teamcity opened this issue Jul 12, 2023 · 15 comments · Fixed by #108029
Assignees
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-release Release Engineering & Automation Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 12, 2023

pkg/upgrade/upgrademanager/upgrademanager_test.TestMigrationFailure failed with artifacts on master @ 87cd65352a9903c292d6b6a2e9856cc1f57bb267:

=== RUN   TestMigrationFailure
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure2966434211
    test_log_scope.go:81: use -show-logs to present logs inline
    manager_external_test.go:687: test will fail at version: 1000022.2-62
    manager_external_test.go:767: 
        	Error Trace:	pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:767
        	            				pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:784
        	Error:      	Not equal: 
        	            	expected: roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:61}
        	            	actual  : roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:60}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -4,3 +4,3 @@
        	            	  Patch: (int32) 0,
        	            	- Internal: (int32) 61
        	            	+ Internal: (int32) 60
        	            	 }
        	Test:       	TestMigrationFailure
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure2966434211
--- FAIL: TestMigrationFailure (28.02s)
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-29654

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 12, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 12, 2023
@rafiss
Copy link
Collaborator

rafiss commented Jul 12, 2023

seems to be the same as #105767

@rafiss rafiss added T-release Release Engineering & Automation Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 12, 2023
@rafiss
Copy link
Collaborator

rafiss commented Jul 12, 2023

This test was recently added in 4f2eca6, so ccing @jeffswenson for any ideas.

@cockroach-teamcity
Copy link
Member Author

pkg/upgrade/upgrademanager/upgrademanager_test.TestMigrationFailure failed with artifacts on master @ 6a6ebcb950c050652ad8cded716dcd242e3a729c:

=== RUN   TestMigrationFailure
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure1447009331
    test_log_scope.go:81: use -show-logs to present logs inline
    manager_external_test.go:687: test will fail at version: 1000022.2-90
    manager_external_test.go:789: 
        	Error Trace:	pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:789
        	Error:      	Received unexpected error:
        	            	pq: verifying precondition for version 1000023.1-2: query-invalid_objects: crdb-internal-jobs-table: system-jobs-scan: column "job_type" does not exist
        	Test:       	TestMigrationFailure
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure1447009331
--- FAIL: TestMigrationFailure (26.87s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@jeffswenson jeffswenson self-assigned this Jul 12, 2023
@cockroach-teamcity
Copy link
Member Author

pkg/upgrade/upgrademanager/upgrademanager_test.TestMigrationFailure failed with artifacts on master @ c21cfc2823728fde7fa6aecc9506a1afa1f83191:

=== RUN   TestMigrationFailure
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure2048388363
    test_log_scope.go:81: use -show-logs to present logs inline
    manager_external_test.go:687: test will fail at version: 1000022.2-60
    manager_external_test.go:767: 
        	Error Trace:	pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:767
        	            				pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:784
        	Error:      	Not equal: 
        	            	expected: roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:59}
        	            	actual  : roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:58}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -4,3 +4,3 @@
        	            	  Patch: (int32) 0,
        	            	- Internal: (int32) 59
        	            	+ Internal: (int32) 58
        	            	 }
        	Test:       	TestMigrationFailure
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure2048388363
--- FAIL: TestMigrationFailure (26.93s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

pkg/upgrade/upgrademanager/upgrademanager_test.TestMigrationFailure failed with artifacts on master @ 2a70922c5ce3d30373adfc52a46f869f9685497c:

=== RUN   TestMigrationFailure
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure1830741605
    test_log_scope.go:81: use -show-logs to present logs inline
    manager_external_test.go:687: test will fail at version: 1000022.2-78
    manager_external_test.go:789: 
        	Error Trace:	pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:789
        	Error:      	Received unexpected error:
        	            	pq: verifying precondition for version 1000023.1-2: query-invalid_objects: crdb-internal-jobs-table: system-jobs-scan: column "job_type" does not exist
        	Test:       	TestMigrationFailure
    panic.go:522: -- test log scope end --
--- FAIL: TestMigrationFailure (27.54s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

pkg/upgrade/upgrademanager/upgrademanager_test.TestMigrationFailure failed with artifacts on master @ 7339e0f900eba75db210874612d6e5aa561c854a:

=== RUN   TestMigrationFailure
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure732820411
    test_log_scope.go:81: use -show-logs to present logs inline
    manager_external_test.go:687: test will fail at version: 1000022.2-78
    manager_external_test.go:789: 
        	Error Trace:	pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:789
        	Error:      	Received unexpected error:
        	            	pq: verifying precondition for version 1000023.1-2: query-invalid_objects: crdb-internal-jobs-table: system-jobs-scan: column "job_type" does not exist
        	Test:       	TestMigrationFailure
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure732820411
--- FAIL: TestMigrationFailure (27.07s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@jeffswenson
Copy link
Collaborator

I was looking into one of the tests with this error because it looked to me like it may be some test logic was wrong:

          Error Trace:  pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:767
                              pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:784
          Error:        Not equal:
                        expected: roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:61}
                        actual  : roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:60}
                        Diff:
                        --- Expected
                        +++ Actual
                        @@ -4,3 +4,3 @@
                          Patch: (int32) 0,
                        - Internal: (int32) 61
                        + Internal: (int32) 60
                         }

But digging into the server logs, it has the same "job_type" error.

W230712 00:11:51.951351 110611 upgrade/upgrademanager/manager.go:370 ⋮ [T10,nsql1,client=127.0.0.1:40310,hostssl,user=root,migration-mgr] 1317  error encountered during version upgrade: verifying precondition for version 1000023.1-2: ‹query-invalid_objects›: ‹crdb-internal-jobs-table›: ‹system-jobs-scan›: column ‹"job_type"› does not exist

So I'm beginning to suspect the test is actually correct and there is a bug in the test cluster infrastructure or a version gate.

@cockroach-teamcity
Copy link
Member Author

pkg/upgrade/upgrademanager/upgrademanager_test.TestMigrationFailure failed with artifacts on master @ 269b9e31c4befe53181340503663ae6e3e8ed07e:

=== RUN   TestMigrationFailure
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure3478277740
    test_log_scope.go:81: use -show-logs to present logs inline
    manager_external_test.go:687: test will fail at version: 1000022.2-86
    manager_external_test.go:767: 
        	Error Trace:	pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:767
        	            				pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:784
        	Error:      	Not equal: 
        	            	expected: roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:85}
        	            	actual  : roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:84}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -4,3 +4,3 @@
        	            	  Patch: (int32) 0,
        	            	- Internal: (int32) 85
        	            	+ Internal: (int32) 84
        	            	 }
        	Test:       	TestMigrationFailure
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure3478277740
--- FAIL: TestMigrationFailure (32.64s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

pkg/upgrade/upgrademanager/upgrademanager_test.TestMigrationFailure failed with artifacts on master @ 2627dc50694837984d7bf7d6c6f25e294ae36039:

=== RUN   TestMigrationFailure
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure4157702422
    test_log_scope.go:81: use -show-logs to present logs inline
    manager_external_test.go:687: test will fail at version: 1000022.2-56
    manager_external_test.go:767: 
        	Error Trace:	pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:767
        	            				pkg/upgrade/upgrademanager/upgrademanager_test_test/pkg/upgrade/upgrademanager/manager_external_test.go:784
        	Error:      	Not equal: 
        	            	expected: roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:55}
        	            	actual  : roachpb.Version{Major:1000022, Minor:2, Patch:0, Internal:54}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -4,3 +4,3 @@
        	            	  Patch: (int32) 0,
        	            	- Internal: (int32) 55
        	            	+ Internal: (int32) 54
        	            	 }
        	Test:       	TestMigrationFailure
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/be5347fadc06714556335e3ce75bfe99/logTestMigrationFailure4157702422
--- FAIL: TestMigrationFailure (28.25s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jul 13, 2023

Also hit in #106574

@jeffswenson
Copy link
Collaborator

I'll skip this for now. I believe I understand the root cause and I think the test is correct. The issue is version flags and AS OF SYSTEM time queries do not interact well.

jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jul 13, 2023
Skip TestMigrationFailure until the source of flakes can be fixed.

Release note: None
Part of: cockroachdb#106648
@jeffswenson
Copy link
Collaborator

The trigger for the flakes is #105832. It adds an upgrade precondition that uses an AS OF SYSTEM TIME query to look for invalid objects.

The actual bug is in the crdb_internal.jobs virtual table. The jobs logical table uses version gates to determine if the 'job_type' backfill has run. The problem is if a query is executed with an AS OF SYSTEM TIME clause that picks a transaction timestamp before the job_type migration, then crdb_internal.jobs will attempt to query the job_type column even though it doesn't exist at the transaction's timestamp.

My first thought for how to fix this is to have the crdb_internal.jobs virtual table consult the state of the descriptor instead of the state of the version when constructing the query. But that doesn't work, because the correct query depends on the state of the backfill, not the existence of the 'job_type' column in the descriptor. Since the query should only read the 'job_type' column after the backfill completes.

This is similar to an issue that I identified when working on the RBR system table migrations. I wrote a utility for reading the cluster version in a transaction. But I'm realizing now that the optimization in version guard to skip reading the version setting after the migration completes depends on query serializability and is broken if used within the context of an AS OF SYSTEM TIME transaction. That's actually okay for the current callers because it's only used by internal queries that do not use AS OF SYSTEM TIME, but it means we can't use it to fix the issue with the crdb_internal.jobs virtual table.

craig bot pushed a commit that referenced this issue Jul 13, 2023
106623: ui: fix timescale for rolling window r=maryliag a=maryliag

The change introduced on #105157 had an undesired
change on the Metrics page. We want to show the end period
of the select, but when the fixed window end for that this
caused the Metrics page to stop loading new data.
We want the metrics page to keep updating when any of the
rolling windows is selected, and we also want to know
the time a time period was requested, so it can be used
to provide more information on SQL Activity pages.

This commit remove the setting of the fixedWindowEnd, since
that was not the best approach and instead creates a value
on local storage for the requestTime, that can be used to create
the label for the timescale, without interferring on
Metrics page (or any other that have an automatic update).

Epic: none

Release note (bug fix): Fix the Metrics page that was not
updating automatically on rolling window options.

Release note (bug fix): Statement Diagnostics not always 
showing is now fixed and they show for the correct time period selected.

106757: upgrademanager: skip TestMigrationFailure r=JeffSwenson a=JeffSwenson

Skip TestMigrationFailure until the source of flakes can be fixed.

Release note: None
Part of: #106648

Co-authored-by: maryliag <[email protected]>
Co-authored-by: Jeff <[email protected]>
jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jul 13, 2023
This test reliably reproduces the source of flakes in cockroachdb#106648.

Release note: None
Part of: cockroachdb#106648
knz added a commit to knz/cockroach that referenced this issue Jul 17, 2023
This also removes `TODOTestTenantDisabled`.

(I have verified the test works, albeit still flaky due to cockroachdb#106648, by
temporarily removing the skip.)

Release note: None
knz added a commit to knz/cockroach that referenced this issue Jul 21, 2023
This also removes `TODOTestTenantDisabled`.

(I have verified the test works, albeit still flaky due to cockroachdb#106648, by
temporarily removing the skip.)

Release note: None
craig bot pushed a commit that referenced this issue Jul 21, 2023
106925: upgrademanager: simplify TestMigrationFailure r=yuzefovich a=knz

Informs #76378 
Epic: CRDB-18499

This also removes `TODOTestTenantDisabled`.

(I have verified the test works, albeit still flaky due to #106648, by temporarily removing the skip.)

Release note: None

107319: opt: add session setting for join elimination optimization r=DrewKimball a=DrewKimball

We recently added support for column remapping in the join elimination rules that allows columns from the eliminated input of the join to be mapped to equivalent columns from the preserved input. This allows joins to be eliminated in more cases - in particular, the self-join patterns that can arise from an `UPDATE ... FROM` statement where the table in the `FROM` clause is the same as the table being updated.

This patch adds a setting, `optimizer_use_improved_join_elimination`, which gates the column-remapping logic for the join-elimination rules. The plan is to backport the column-remapping changes to 23.1 behind this setting turned off by default.

Informs #102614

Release note: None

107392: go.mod: bump Pebble to 94f91669 r=RaduBerinde a=RaduBerinde

94f91669 db: rename Experimental.SharedStorage to RemoteStorage in Options
c62c9127 objstorage: rename objstorage/shared package to remote
692f3b61 vfs: move errorfs package to vfs dir to allow use in CockroachDB
5e550364 cmd/pebble: allow passing an explicit checkpoint directory path
9c3f337a cmd/replay: always log checkpoint initialization
0cd8f1b8 replay: calculate separate write stall metrics for each reason
f9d08867 cmd/pebble: use non-default block cache size in replay tool
d0e583f8 *: improve humanize format
5e76dfab db: minor update of comment for TargetByteDeletionRate
456a2a2a objstorage: rename Shared to Remote in the objstorage provider API

Release note (general change): Formatting of byte figures in Pebble logs has been improved; any tools that parse these logs might need updating.

Epic: none

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Drew Kimball <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@jeffswenson jeffswenson assigned adityamaru and unassigned jeffswenson Aug 1, 2023
@jeffswenson
Copy link
Collaborator

@adityamaru can you unskip this test once the fix to #106762 merges?

@adityamaru
Copy link
Contributor

#107570 seems to have fixed the same spot this test was failing at. 20 minutes of stressing and I don't see a failure yet.

@blathers-crl
Copy link

blathers-crl bot commented Aug 4, 2023

cc @cockroachdb/disaster-recovery

craig bot pushed a commit that referenced this issue Aug 4, 2023
108029: upgrademanager: unskip TestMigrationFailure r=msbutler a=adityamaru

The test no longer fails with our change in #107570.

Fixes: #106648
Fixes: #106762
Release note: None

108192: ccl/sqlproxyccl: deflake TestConnectionMigration r=darinpp a=jaylim-crl

Fixes #106885.

This test flake seems extremely rare, and it's unclear why it occurred in the
first place. The past 1000 runs (all of what TC has) have been successful.
Regardless, this commit attempts at deflaking TestConnectionMigration. Given
that some subtests transfer connections through `transferConnWithRetries`, it
is possible that the transfer was retried, causing the metric to be
incremented.

Release note: None

Epic: none

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Jay <[email protected]>
@craig craig bot closed this as completed in 9cd15a2 Aug 5, 2023
adityamaru added a commit that referenced this issue Aug 18, 2023
The test no longer fails with our change in

Fixes: #106648
Fixes: #106762
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-release Release Engineering & Automation Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants