Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: sql,jobs: short-term fix for UndefinedColumn job_type error #108991

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Aug 18, 2023

Please see individual commits for details. This is a backport of:
#107570
#108029
#108357

Fixes: #108834
Release justification: fixes a flaky mixed-version roachtest

In cockroachdb#106762 we noticed that if
a query is executed with an AS OF SYSTEM TIME clause that picks a transaction
timestamp before the job_type migration, then parts of the jobs
infrastructure will attempt to query the job_type column even though it
doesn't exist at the transaction's timestamp.

As a short term fix, when we encounter an `UndefinedColumn` error for
the `job_type` column in `crdb_internal.jobs` we
generate a synthetic retryable error so that the txn is pushed to a
higher timestamp at which the upgrade will have completed and the
`job_type` column will be visible. The longer term fix is being tracked
in cockroachdb#106764.

We are intentionally approaching this issue with a whack-a-mole
approach to stabilize the tests the are running into this issue. We
think time is better spent designing and investing in the longer term
solution that will be tracked in cockroachdb#106764.

Fixes: cockroachdb#107169
Informs: cockroachdb#106762
Release note: None
The test no longer fails with our change in

Fixes: cockroachdb#106648
Fixes: cockroachdb#106762
Release note: None
Similar to cockroachdb#107570
this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME
picks a transaction timestamp before the job_info migration has run.
In which case parts of the jobs infrastructure will attempt to query
the job_info column even though it doesn't exist at the transaction's timestamp.

As a short term fix, when we encounter an UndefinedObject error for the job_info table
we generate a synthetic retryable error so that the txn is pushed to a higher timestamp
at which the upgrade will have completed and the job_info table will be visible.
The longer term fix is being tracked in cockroachdb#106764.

On master I can no longer reproduce the failure in cockroachdb#105032 but
on 23.1 with this change I can successfully run 30 iterations of the test
on a seed (-8690666577594439584) which previously saw occurrences
of this flake.

Fixes: cockroachdb#103239
Fixes: cockroachdb#105032

Release note: None
@adityamaru adityamaru requested review from dt and stevendanna August 18, 2023 13:39
@adityamaru adityamaru requested review from a team as code owners August 18, 2023 13:39
@blathers-crl
Copy link

blathers-crl bot commented Aug 18, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl
Copy link

blathers-crl bot commented Aug 18, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru merged commit 636ffdb into cockroachdb:release-23.1 Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants