Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGIONAL BY ROW data inconsistent after changing node's locality #84144

Closed
FireMasterK opened this issue Jul 9, 2022 · 9 comments · Fixed by #84305
Closed

REGIONAL BY ROW data inconsistent after changing node's locality #84144

FireMasterK opened this issue Jul 9, 2022 · 9 comments · Fixed by #84305
Assignees
Labels
A-disaster-recovery A-multiregion Related to multi-region branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory O-community Originated from the community S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. X-blathers-triaged blathers was able to find an owner

Comments

@FireMasterK
Copy link

FireMasterK commented Jul 9, 2022

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

To Reproduce

  1. I had a few tables which were REGIONAL BY ROW.
  2. I restarted a few nodes, changing the --locality= parameter.
  3. Ran ALTER TABLE {table} SET LOCALITY REGIONAL BY TABLE IN PRIMARY REGION;
  4. The database appeared to be corrupt.
  5. See the following error:
ERROR: failed to construct index entries during backfill: error decoding 4 bytes: could not find [96] in enum "crdb_internal_region" representation PhysicalReps: [[80]]; LogicalReps: [eu] goroutine 54102 [running]:
runtime/debug.Stack()
        GOROOT/src/runtime/debug/stack.go:24 +0x65
github.com/cockroachdb/cockroach/pkg/sql/types.(*T).EnumGetIdxOfPhysical(0xc00a506780, {0xc00adac593, 0x45dde60, 0xd})
        github.com/cockroachdb/cockroach/pkg/sql/types/pkg/sql/types/types.go:2520 +0x234
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.GetEnumComponentsFromPhysicalRep(0xc00a506780, {0xc00adac593, 0x0, 0xc00282be0e})
        github.com/cockroachdb/cockroach/pkg/sql/sem/tree/datum.go:4529 +0x27
github.com/cockroachdb/cockroach/pkg/sql/rowenc/keyside.Decode(0xc000a00d88, 0xc00a506780, {0xc00adac592, 0xa, 0xa}, 0x0)
        github.com/cockroachdb/cockroach/pkg/sql/rowenc/keyside/decode.go:267 +0x2417
github.com/cockroachdb/cockroach/pkg/sql/rowenc.(*EncDatum).EnsureDecoded(0xc0042fef38, 0x631e378, 0xc002bfd290)
        github.com/cockroachdb/cockroach/pkg/sql/rowenc/encoded_datum.go:251 +0x194
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).NextRowDecodedInto(0xc0042ff2f0, {0x631e378, 0xc002bfd290}, {0xc00066bb20, 0x7, 0x0}, {{{0x76543210, 0x0}, 0x0}, {0x0, ...}, ...})
        github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:1044 +0x1f7
github.com/cockroachdb/cockroach/pkg/sql/backfill.(*IndexBackfiller).BuildIndexEntriesChunk(0xc000a00c80, {0x631e378, 0xc002bfd290}, 0x203001, {0x644a8d8, 0xc000d3cf00}, {{0xc00adac190, 0x6, 0x8}, {0xc00adac198, ...}}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/backfill/backfill.go:856 +0xb1c
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).buildIndexEntryBatch.func1({0x631e378, 0xc002bfd290}, 0xc004b39200)
        github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:435 +0xf2
github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1({0x631e378, 0xc002bfd290}, 0xc0042ff678)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:949 +0x27
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec(0xc004bfe2c0, {0x631e378, 0xc002bfd290}, 0xc0042ff980)
        github.com/cockroachdb/cockroach/pkg/kv/txn.go:993 +0xae
github.com/cockroachdb/cockroach/pkg/kv.runTxn({0x631e378, 0xc002bfd290}, 0xc0021a8750, 0x9121d9)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:948 +0x5a
github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl(0x631e378, {0x631e378, 0xc002bfd290}, 0xa, 0x0, 0x48d820)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:911 +0x89
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn(...)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:890
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).buildIndexEntryBatch(0xc000a00c80, {0x631e378, 0xc004b39200}, {{0xc00adac190, 0x6, 0x8}, {0xc00adac198, 0x2, 0x8}}, {0x17003d6871733671, ...})
        github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:428 +0x2fc
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).constructIndexEntries(0xc000a00c80, {0x631e378, 0xc004b39200}, 0xc0119bea20)
        github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:134 +0x305
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).runBackfill.func1({0x631e2d0, 0xc00bd5ae80})
        github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:319 +0xcb
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
        github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:169 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
        golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57 +0x67
created by golang.org/x/sync/errgroup.(*Group).Go
        golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:54 +0x92

Expected behavior
The database to work as expected.

Additional data / screenshots
I can provide any information requested.

Environment:

  • CockroachDB version: 22.1
  • Server OS: Linux

Additional context
The database's data was inconsistent, backups didn't work, any REGIONAL BY ROW table was affected.

Jira issue: CRDB-17489

@FireMasterK FireMasterK added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 9, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 9, 2022

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/bulk-io (found keywords: backup)
  • @cockroachdb/kv (found keywords: kv)
  • @cockroachdb/sql-schema (found keywords: ALTER TABLE)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added A-disaster-recovery O-community Originated from the community X-blathers-triaged blathers was able to find an owner T-disaster-recovery labels Jul 9, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 9, 2022

cc @cockroachdb/bulk-io

@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Jul 9, 2022
@dt dt removed T-disaster-recovery T-sql-schema-deprecated Use T-sql-foundations instead labels Jul 9, 2022
@knz knz added the A-multiregion Related to multi-region label Jul 11, 2022
@knz
Copy link
Contributor

knz commented Jul 11, 2022

Hi @FireMasterK . Thanks for filing the issue.
Until we figure a way forward, you should be able to restore your node locality flags to the previous value. That will restore functionality.

Also, could you explain to us why you changed the locality, and what you were trying to achieve with this?

@FireMasterK
Copy link
Author

Until we figure a way forward, you should be able to restore your node locality flags to the previous value. That will restore functionality.

Unfortunately, it looks like some data was lost when I did that. I did SELECT crdb_region, id FROM users;
I got some data for some rows but most of it is lost:

Logs from command
  eu          | 777930638806155278
  eu          | 777930855520108555
...
(1024 rows)
(error encountered after some results were delivered)
ERROR: internal error: unexpected error from the vectorized engine: could not find [96] in enum "crdb_internal_region" representation PhysicalReps: [[80]]; LogicalReps: [eu] goroutine 6661 [running]:
runtime/debug.Stack()
        GOROOT/src/runtime/debug/stack.go:24 +0x65
github.com/cockroachdb/cockroach/pkg/sql/types.(*T).EnumGetIdxOfPhysical(0xc006368f00, {0xc005dd3db3, 0x4a35560, 0xd})
        github.com/cockroachdb/cockroach/pkg/sql/types/pkg/sql/types/types.go:2520 +0x234
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.GetEnumComponentsFromPhysicalRep(0xc006368f00, {0xc005dd3db3, 0x23cb000000000000, 0x0})
        github.com/cockroachdb/cockroach/pkg/sql/sem/tree/datum.go:4529 +0x27
github.com/cockroachdb/cockroach/pkg/sql/rowenc/keyside.Decode(0xc005d54090, 0xc006368f00, {0xc005dd3db2, 0x5, 0x0}, 0x0)
        github.com/cockroachdb/cockroach/pkg/sql/rowenc/keyside/decode.go:267 +0x2417
github.com/cockroachdb/cockroach/pkg/sql/colencoding.decodeTableKeyToCol(0xc005d71610, 0xc0022c0b80, 0xc00360e630, 0x19e, 0xc005ef87c0, {0xc005dd3db2, 0x460ef4, 0x2472270}, 0x0, 0x0, ...)
        github.com/cockroachdb/cockroach/pkg/sql/colencoding/key_encoding.go:211 +0xa96
github.com/cockroachdb/cockroach/pkg/sql/colencoding.DecodeKeyValsToCols(0x631e378, 0xc00360edb0, 0x0, {0xc003e00d80, 0x2, 0x10100002ff30000}, 0x0, {0xc004026de0, 0x2, 0x2}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/colencoding/key_encoding.go:72 +0x3af
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*cFetcher).NextBatch(0xc0022c0a80, {0x631e378, 0xc00360edb0})
        github.com/cockroachdb/cockroach/pkg/sql/colfetcher/cfetcher.go:756 +0x57d
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*ColBatchScan).Next(0xc0048746e0)
        github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:107 +0x30
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*CancelChecker).Next(0xc005ef8980)
        github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/cancel_checker.go:59 +0x30
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*simpleProjectOp).Next(0xc004027140)
        github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/simple_project.go:125 +0x3f
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*batchInfoCollector).next(...)
        github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:99
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0xc005ef89c0)
        github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91 +0x62
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*batchInfoCollector).Next(0xc001c081c0)
        github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:107 +0x4f
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*deselectorOp).Next(0xc0040273e0)
        github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/deselector.go:55 +0x3e
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).sendBatches.func1()
        github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:289 +0xbe
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0x7f2d2fb93108)
        github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91 +0x62
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).sendBatches(0x0, {0x631e378, 0xc00360ea80}, {0x7f2d0156bb68, 0xc0023885f0}, 0x2540be400, 0x1)
        github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:280 +0xe5
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).runWithStream(0xc001c08380, {0x631e378, 0xc00360ea80}, {0x7f2d0156bb68, 0xc0023885f0}, 0xc0023880a0, 0xc002388470)
        github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:414 +0x1cb
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).Run(0xc001c08380, {0x631e378, 0xc00360e870}, {0x6274c60, 0xc000a68180}, 0x639c960, {{0x9d, 0x5c, 0x52, 0x81, ...}}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:222 +0x375
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).setupRemoteOutputStream.func1({0x631e378, 0xc00360e870}, 0xc0028e07d0)
        github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:692 +0xa5
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreatorHelper).accumulateAsyncComponent.func1.1()
        github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1274 +0x32
created by github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreatorHelper).accumulateAsyncComponent.func1
        github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1273 +0xef
SQLSTATE: XX000
DETAIL: stack trace:
github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:88: func1()
GOROOT/src/runtime/panic.go:1038: gopanic()
github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:192: InternalError()
github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:109: Next()
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/cancel_checker.go:59: Next()
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/simple_project.go:125: Next()
github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:99: next()
github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91: CatchVectorizedRuntimeError()
github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:107: Next()
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/deselector.go:55: Next()
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:289: func1()
github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91: CatchVectorizedRuntimeError()
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:280: sendBatches()
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:414: runWithStream()
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:222: Run()
github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:692: func1()
github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1274: 1()
GOROOT/src/runtime/asm_amd64.s:1581: goexit()

The eu locality doesn't exist anymore, but the SQL statement still shows it.

If I alter the statement to just SELECT id FROM users;, I get all rows as expected.

Also, could you explain to us why you changed the locality, and what you were trying to achieve with this?

I had regions such as region=nl,zone=nl-ams, region=gb,zone=nl-cwl previously, I changed it to region=eu,zone=eu-ams and region=eu,zone=eu-ams respectively.

My goal was to group and home all Europe-based data together.

@knz knz added the S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. label Jul 11, 2022
@ajstorm
Copy link
Collaborator

ajstorm commented Jul 11, 2022

When you changed to the eu localities, did you also add that region to the database (using an ADD REGION command)? If so, were some rows inserted during the time when eu was an added region? Did you drop any regions (using DROP REGION) at any time in this process?

@FireMasterK
Copy link
Author

Yes, I did, I tried that in an attempt to try disabling multi-region capabilities to see if that resolves the issue, but it did not.

I added eu as a primary region, and attempted to drop everything after that.

My show regions statement still shows eu.

show regions from database {db};
  database | region | primary | zones
-----------+--------+---------+--------
  {db}    | eu     |  true   | {}
(1 row)

show regions;
  region |         zones          | database_names | primary_region_of
---------+------------------------+----------------+--------------------
  ca     | {ca-yul,ca-yyz}        | {}             | {}
  gb     | {gb-cwl}               | {}             | {}
  in     | {in-bom}               | {}             | {}
  nl     | {nl-ams}               | {}             | {}
  us     | {us-ewr,us-lax,us-slc} | {}             | {}
(5 rows)

@arulajmani
Copy link
Collaborator

Thanks for reporting this @FireMasterK. I still haven't gotten to the bottom of this, but here's what I've found so far. Turns out, the situation you find yourself in has nothing to do with the --locality changes. Here's the simplest repro I've found so far:

statement ok
DROP DATABASE mr;

statement ok
CREATE DATABASE mr PRIMARY REGION "us-east-1" REGIONS "ca-central-1";

statement ok
USE mr;
CREATE TABLE kv(k INT PRIMARY KEY, v INT) LOCALITY REGIONAL BY TABLE IN PRIMARY REGION;

statement ok 
ALTER TABLE kv SET LOCALITY REGIONAL BY ROW;

statement ok
INSERT INTO kv (crdb_region, k, v) VALUES ('us-east-1', 1, 1);
INSERT INTO kv (crdb_region, k, v) VALUES ('ca-central-1', 2, 2)

statement ok <---- This should fail; but it doesn't.
ALTER DATABASE mr DROP REGION "ca-central-1";

Running the above repro through a debugger, it looks like we aren't tracking references to the table on the type descriptor. It's also worth noting that this only happens when you alter to a REGIONAL BY ROW table -- things would have worked as expected if we had created a REGIONAL BY ROW table to begin with!

It also seems like this is only ever an issue on release-22.1. We tried the above repro out on 21.2 cluster, and things worked (failed!) as expected.

@postamar
Copy link
Contributor

@ZhouXing19 can I take this issue?

@ZhouXing19 ZhouXing19 assigned postamar and unassigned ZhouXing19 Jul 12, 2022
craig bot pushed a commit that referenced this issue Jul 12, 2022
84305: sql: add missing back-reference in region enum type r=postamar a=postamar

This commit fixes a long-standing bug in which making an existing
multi-region table REGIONAL BY ROW would omit adding a back-reference to
the table in the region enum type. This puts the table at risk for
corruption in the face of schema changes to the region enum, which is
unaware of this dependency.

Fixes #84144.

Release note: None

84308: Revert "ccl: upgrade by-name sequence reference to by-ID during restore" r=Xiang-Gu a=Xiang-Gu

This reverts commit f498598 because
this PR introduced some dependencies that violates certain existing
dependency assertions.

Release note: None

Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
@postamar
Copy link
Contributor

I'm merging and backporting #84305 which fixes the type descriptor corruption that ultimately caused this error, but does nothing to fix clusters which have already been corrupted by an ALTER TABLE ... SET LOCALITY REGIONAL BY ROW: a subsequent DROP REGION like in the repro will happily succeed rather than fail like it should. I'm tracking fixing that in a separate issue: #84322

@craig craig bot closed this as completed in b665ccd Jul 13, 2022
postamar pushed a commit to postamar/cockroach that referenced this issue Jul 13, 2022
This commit fixes a long-standing bug in which making an existing
multi-region table REGIONAL BY ROW would omit adding a back-reference to
the table in the region enum type. This puts the table at risk for
corruption in the face of schema changes to the region enum, which is
unaware of this dependency.

Informs cockroachdb#84144.

Release note (bug fix): fixes bug where ALTER TABLE ... SET LOCALITY
REGIONAL BY ROW could leave the region enum type descriptor unaware of
a dependency on the altered table. This would, in turn, wrongly permit
a DROP REGION to succeed, rendering the table unusable. Note that this
fix does not help existing clusters which have already run such an ALTER
TABLE; see cockroachdb#84322 for that.
postamar pushed a commit to postamar/cockroach that referenced this issue Jul 13, 2022
This commit fixes a long-standing bug in which making an existing
multi-region table REGIONAL BY ROW would omit adding a back-reference to
the table in the region enum type. This puts the table at risk for
corruption in the face of schema changes to the region enum, which is
unaware of this dependency.

Fixes cockroachdb#84144.

This bug seems to have manifested itself only in 22.1 and later after
718222f was merged. We're applying this
patch to 21.2 out of prudence.

Release note: None
postamar pushed a commit to postamar/cockroach that referenced this issue Jul 13, 2022
This commit is featured only in the release-22.1 branch and
repairs missing table back-references in the region enum type descriptor
before performing a DROP REGION or DROP SUPER REGION.

Partially addresses cockroachdb#84322.

See also cockroachdb#84144.

Release note: None
@rytaft rytaft added C-technical-advisory Caused a technical advisory branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-multiregion Related to multi-region branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory O-community Originated from the community S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants