Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colfetcher: incorrect decoding of unique secondary indexes with multiple column families #66706

Closed
cockroach-teamcity opened this issue Jun 22, 2021 · 20 comments · Fixed by #68071
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

roachtest.tlp failed with artifacts on master @ 1b686aef9949c1c7ef930b55bd1fbc0ed2e8268a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tlp/run_1
	tlp.go:125,test_runner.go:757: expected unpartitioned count 735 to equal partitioned count 344
		(1) attached stack trace
		  -- stack trace:
		  | main.runTLPQuery.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tlp.go:188
		  | main.runWithTimeout.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tlp.go:200
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) expected unpartitioned count 735 to equal partitioned count 344
		  | sql: SELECT count(*) FROM defaultdb.public.table4 AS tab_28174
		  | SELECT count(*) FROM (SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING UNION ALL SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE NOT (tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING) UNION ALL SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE (tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING) IS NULL)
		Error types: (1) *withstack.withStack (2) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh tlp

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 22, 2021
@mgartner mgartner self-assigned this Jun 22, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ eb7f3fe2991992472a68d701d65543f7a9c7fb56:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tlp/run_1
	tlp.go:127,test_runner.go:761: expected unpartitioned count 1 to equal partitioned count 2
		(1) attached stack trace
		  -- stack trace:
		  | main.runTLPQuery.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tlp.go:190
		  | main.runWithTimeout.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tlp.go:202
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) expected unpartitioned count 1 to equal partitioned count 2
		  | sql: SELECT count(*) FROM defaultdb.public.table3 AS tab_13438
		  | SELECT count(*) FROM (SELECT * FROM defaultdb.public.table3 AS tab_13438 WHERE ('bf867b7d-45e5-4f25-bc73-ade3cbc129e8' <= gen_random_uuid()::UUID) UNION ALL SELECT * FROM defaultdb.public.table3 AS tab_13438 WHERE NOT (('bf867b7d-45e5-4f25-bc73-ade3cbc129e8' <= gen_random_uuid()::UUID)) UNION ALL SELECT * FROM defaultdb.public.table3 AS tab_13438 WHERE (('bf867b7d-45e5-4f25-bc73-ade3cbc129e8' <= gen_random_uuid()::UUID)) IS NULL)
		Error types: (1) *withstack.withStack (2) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh tlp

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

@mgartner
Copy link
Collaborator

mgartner commented Jul 2, 2021

The last failure looks like a bug in the TLP generator. It should not be generating non-immutable predicates with gen_random_uuid().

@mgartner
Copy link
Collaborator

mgartner commented Jul 2, 2021

The last failure looks like a bug in the TLP generator. It should not be generating non-immutable predicates with gen_random_uuid().

This is fixed by #67194.

@mgartner
Copy link
Collaborator

Here is the log of SQL statements that resulted in the TLP failure: https://gist.github.com/mgartner/c4a99f034fefd97c55537a87cc5f89f1

@mgartner
Copy link
Collaborator

Well... this is interesting:

(735 rows)


Time: 17ms total (execution 8ms / network 9ms)

root@:26257/defaultdb> SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING
UNION ALL
SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE NOT (tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING)
UNION ALL
SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE (tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING) IS NULL;

...

(735 rows)
root@:26257/defaultdb> SELECT count(*) FROM (
SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING
UNION ALL
SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE NOT (tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING)
UNION ALL
SELECT * FROM defaultdb.public.table4 AS tab_28174 WHERE (tab_28174.col4_4 NOT LIKE st_ashexewkb('01010000A0E6100000F88820881F275D409438543BC09151C072501D885295F141'::GEOGRAPHY)::STRING) IS NULL
);
  count
---------
    344
(1 row)

@mgartner
Copy link
Collaborator

Here's a partially minimized repro: https://gist.github.com/mgartner/3515443d50270d5d4232ce5753c2cb1b

@mgartner
Copy link
Collaborator

I managed to minimize this quite a bit. The following test will fail after a few rounds of stressing. It fails with both the computed column set as VIRTUAL or STORED.

statement ok
CREATE TABLE t (
  s STRING,
  t TIMESTAMP,
  v STRING AS (lower(CAST(t AS STRING))) VIRTUAL,
  UNIQUE (v, t, s)
);

statement ok
INSERT INTO t VALUES
(NULL,NULL),
(NULL,'2025-07-03 10:08:01.000984+00:00'),
(NULL,'2025-07-03 10:08:01.000984+00:00');

query I
SELECT count(*) FROM t
----
3

query I
SELECT count(*) FROM (
  SELECT * FROM t
    WHERE v = 'foo'
  UNION ALL
  SELECT * FROM t
    WHERE NOT (v = 'foo')
  UNION ALL
  SELECT * FROM t WHERE (v = 'foo') IS NULL
)
----
3

@yuzefovich
Copy link
Member

Looks like it fails only for the vectorized engine.

@mgartner
Copy link
Collaborator

Another clue: it only happens with more than 1 FAMILY. The non-deterministic nature of the failure comes from the test apparatus, which is good. This test fails on every run:

statement ok
SET vectorize=on

statement ok
CREATE TABLE t (
  s STRING,
  t TIMESTAMP,
  v STRING AS (lower(CAST(t AS STRING))) STORED,
  UNIQUE (v, t, s),
  FAMILY (s),
  FAMILY (t, v)
)

statement ok
INSERT INTO t VALUES
  (NULL, NULL),
  (NULL, '2025-07-03 10:08:01.000984+00:00'),
  (NULL, '2025-07-03 10:08:01.000984+00:00')

query I
SELECT count(*) FROM t
----
3

query I
SELECT count(*) FROM (
  SELECT * FROM t
    WHERE v = 'foo'
  UNION ALL
  SELECT * FROM t
    WHERE NOT (v = 'foo')
  UNION ALL
  SELECT * FROM t WHERE (v = 'foo') IS NULL
)
----
3

@mgartner
Copy link
Collaborator

mgartner commented Jul 15, 2021

The bug appears to exist on 21.1.3 but not 20.2.8.

EDIT: This problem exists on 20.1, 20.2, and 21.1. It does not exist on 19.2.

@mgartner
Copy link
Collaborator

mgartner commented Jul 15, 2021

Here's a simpler reproduction:

statement ok
SET vectorize=on

statement ok
CREATE TABLE t (
  s STRING,
  v STRING,
  UNIQUE INDEX u (v, s),
  FAMILY (s),
  FAMILY (v)
)

statement ok
INSERT INTO t VALUES
  (NULL, 'bar'),
  (NULL, 'bar')

query T
SELECT v FROM t@u WHERE NOT (v = 'foo')
----
bar
bar

@mgartner
Copy link
Collaborator

Some relevant code pointers:

This block is not executed although it should be because foundNull is false:

if foundNull && rf.table.isSecondaryIndex && rf.table.index.IsUnique() && rf.table.desc.NumFamilies() != 1 {
// We get the remaining bytes after the computed prefix, and then
// slice off the extra encoded columns from those bytes. We calculate
// how many bytes were sliced away, and then extend lastRowPrefix
// by that amount.
prefixLen := len(rf.machine.lastRowPrefix)
remainingBytes := rf.machine.nextKV.Key[prefixLen:]
origRemainingBytesLen := len(remainingBytes)
for i := 0; i < rf.table.index.NumKeySuffixColumns(); i++ {
var err error
// Slice off an extra encoded column from remainingBytes.
remainingBytes, err = rowenc.SkipTableKey(remainingBytes)
if err != nil {
return nil, err
}
}
rf.machine.lastRowPrefix = rf.machine.nextKV.Key[:prefixLen+(origRemainingBytesLen-len(remainingBytes))]
}

foundNull is false because we are not requiring the decoding of a nullable column:

indexOrds := rf.table.indexColOrdinals
if rf.traceKV {
indexOrds = rf.table.allIndexColOrdinals
}
key, matches, foundNull, err = colencoding.DecodeIndexKeyToCols(

The fix might be to require decoding of nullable columns indexed in unique secondary indexes by changing the logic that sets indexColOrdinals, maybe around here:

if cap(table.indexColOrdinals) >= nIndexCols {
table.indexColOrdinals = table.indexColOrdinals[:nIndexCols]
} else {
table.indexColOrdinals = make([]int, nIndexCols)
}
if cap(table.allIndexColOrdinals) >= nIndexCols {
table.allIndexColOrdinals = table.allIndexColOrdinals[:nIndexCols]
} else {
table.allIndexColOrdinals = make([]int, nIndexCols)
}

This commit may be related too: 1bcc7b8

@DrewKimball
Copy link
Collaborator

Are we for sure fetching these nullable columns? As in, they're encoded in the byte slice, but we just skip past them during decoding?

@yuzefovich
Copy link
Member

Yeah. I think this diff is what we want to do:

diff --git a/pkg/sql/colencoding/key_encoding.go b/pkg/sql/colencoding/key_encoding.go
index 7d0188fab1..9f6e80b04b 100644
--- a/pkg/sql/colencoding/key_encoding.go
+++ b/pkg/sql/colencoding/key_encoding.go
@@ -167,6 +167,8 @@ func DecodeKeyValsToCols(
                var err error
                i := indexColIdx[j]
                if i == -1 {
+                       _, isNull := encoding.DecodeIfNull(key)
+                       foundNull = isNull || foundNull
                        // Don't need the coldata - skip it.
                        key, err = rowenc.SkipTableKey(key)
                } else {

Checking whether we have a NULL value without fully decoding seems very cheap, so I'm thinking it's probably not worth plumbing the flag about unique secondary indexes with multiple column families to see whether we should be checking for NULL.

@mgartner
Copy link
Collaborator

I don't think we need to plumb a flag, we just need to change the logic where indexColOrdinals is set to include nullable columns if the index is a unique secondary index.

@yuzefovich
Copy link
Member

I don't think that's the best solution because that would make us fully decode those nullable columns which is an overkill in this case (unless I'm misunderstanding what indexColOrdinals does).

@mgartner
Copy link
Collaborator

Great point. On the other hand, your proposal would make us perform unnecessary null-decoding checks for unneeded columns in non-unique secondary indexes. I don't yet have intuition about which would be worse.

@yuzefovich
Copy link
Member

I think it would happen for the unneeded columns from the primary indexes too. I guess plumbing the flag plus the diff above might be the safest option.

@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ 7df66cb1840c263270bd2b1a690c8e9a7c025333:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tlp/run_1
	tlp.go:128,test_runner.go:765: expected unpartitioned count 22 to equal partitioned count 20
		(1) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLPQuery.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:191
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runWithTimeout.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) expected unpartitioned count 22 to equal partitioned count 20
		  | sql: SELECT count(*) FROM defaultdb.public.table1 AS tab_1933
		  | SELECT count(*) FROM (SELECT * FROM defaultdb.public.table1 AS tab_1933 WHERE tab_1933.col1_3 UNION ALL SELECT * FROM defaultdb.public.table1 AS tab_1933 WHERE NOT (tab_1933.col1_3) UNION ALL SELECT * FROM defaultdb.public.table1 AS tab_1933 WHERE (tab_1933.col1_3) IS NULL)
		Error types: (1) *withstack.withStack (2) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh tlp

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

mgartner added a commit to mgartner/cockroach that referenced this issue Jul 26, 2021
`colfetcher` must detect `NULL` values in unique secondary index keys on
tables with multiple column families in order to determine whether
consecutive KVs belongs to the same row or different rows. Previously,
only the columns in the key that were needed by the query were decoded
and checked for `NULL`. This caused incorrect query results when `NULL`
column values were not detected because those columns were not needed.

This commit fixes the issue by checking all columns for `NULL` when
decoding unique secondary index keys on tables with multiple column
families.

Fixes cockroachdb#66706

Release note (bug fix): A bug has been fix that caused incorrect query
results when querying tables with multiple column families and unique
secondary indexes. The bug only occurred if 1) vectorized execution was
enabled for the query, 2) the query scanned a unique secondary index
that contained columns from more than one column family, and 3) rows
fetched by the query contained NULL values for some of the indexed
columns. This bug was present since version 20.1.
@mgartner
Copy link
Collaborator

I've confirmed that the last failure report is the same bug.

craig bot pushed a commit that referenced this issue Jul 27, 2021
68071: colfetcher: fix NULL checks during unique index decoding r=mgartner a=mgartner

`colfetcher` must detect `NULL` values in unique secondary index keys on
tables with multiple column families in order to determine whether
consecutive KVs belongs to the same row or different rows. Previously,
only the columns in the key that were needed by the query were decoded
and checked for `NULL`. This caused incorrect query results when `NULL`
column values were not detected because those columns were not needed.

This commit fixes the issue by checking all columns for `NULL` when
decoding unique secondary index keys on tables with multiple column
families.

Fixes #66706

Release note (bug fix): A bug has been fix that caused incorrect query
results when querying tables with multiple column families and unique
secondary indexes. The bug only occurred if 1) vectorized execution was
enabled for the query, 2) the query scanned a unique secondary index
that contained columns from more than one column family, and 3) rows
fetched by the query contained NULL values for some of the indexed
columns. This bug was present since version 20.1.

Co-authored-by: Marcus Gartner <[email protected]>
@craig craig bot closed this as completed in e906b96 Jul 27, 2021
mgartner added a commit to mgartner/cockroach that referenced this issue Jul 29, 2021
`colfetcher` must detect `NULL` values in unique secondary index keys on
tables with multiple column families in order to determine whether
consecutive KVs belongs to the same row or different rows. Previously,
only the columns in the key that were needed by the query were decoded
and checked for `NULL`. This caused incorrect query results when `NULL`
column values were not detected because those columns were not needed.

This commit fixes the issue by checking all columns for `NULL` when
decoding unique secondary index keys on tables with multiple column
families.

Fixes cockroachdb#66706

Release note (bug fix): A bug has been fix that caused incorrect query
results when querying tables with multiple column families and unique
secondary indexes. The bug only occurred if 1) vectorized execution was
enabled for the query, 2) the query scanned a unique secondary index
that contained columns from more than one column family, and 3) rows
fetched by the query contained NULL values for some of the indexed
columns. This bug was present since version 20.1.
mgartner added a commit to mgartner/cockroach that referenced this issue Jul 29, 2021
`colfetcher` must detect `NULL` values in unique secondary index keys on
tables with multiple column families in order to determine whether
consecutive KVs belongs to the same row or different rows. Previously,
only the columns in the key that were needed by the query were decoded
and checked for `NULL`. This caused incorrect query results when `NULL`
column values were not detected because those columns were not needed.

This commit fixes the issue by checking all columns for `NULL` when
decoding unique secondary index keys on tables with multiple column
families.

Fixes cockroachdb#66706

Release note (bug fix): A bug has been fix that caused incorrect query
results when querying tables with multiple column families and unique
secondary indexes. The bug only occurred if 1) vectorized execution was
enabled for the query, 2) the query scanned a unique secondary index
that contained columns from more than one column family, and 3) rows
fetched by the query contained NULL values for some of the indexed
columns. This bug was present since version 20.1.
mgartner added a commit to mgartner/cockroach that referenced this issue Jul 29, 2021
`colfetcher` must detect `NULL` values in unique secondary index keys on
tables with multiple column families in order to determine whether
consecutive KVs belongs to the same row or different rows. Previously,
only the columns in the key that were needed by the query were decoded
and checked for `NULL`. This caused incorrect query results when `NULL`
column values were not detected because those columns were not needed.

This commit fixes the issue by checking all columns for `NULL` when
decoding unique secondary index keys on tables with multiple column
families.

Fixes cockroachdb#66706

Release note (bug fix): A bug has been fix that caused incorrect query
results when querying tables with multiple column families and unique
secondary indexes. The bug only occurred if 1) vectorized execution was
enabled for the query, 2) the query scanned a unique secondary index
that contained columns from more than one column family, and 3) rows
fetched by the query contained NULL values for some of the indexed
columns. This bug was present since version 20.1.
mgartner added a commit to mgartner/cockroach that referenced this issue Aug 11, 2021
`colfetcher` must detect `NULL` values in unique secondary index keys on
tables with multiple column families in order to determine whether
consecutive KVs belongs to the same row or different rows. Previously,
only the columns in the key that were needed by the query were decoded
and checked for `NULL`. This caused incorrect query results when `NULL`
column values were not detected because those columns were not needed.

This commit fixes the issue by checking all columns for `NULL` when
decoding unique secondary index keys on tables with multiple column
families.

Fixes cockroachdb#66706

Release note (bug fix): A bug has been fix that caused incorrect query
results when querying tables with multiple column families and unique
secondary indexes. The bug only occurred if 1) vectorized execution was
enabled for the query, 2) the query scanned a unique secondary index
that contained columns from more than one column family, and 3) rows
fetched by the query contained NULL values for some of the indexed
columns. This bug was present since version 20.1.
@mgartner mgartner changed the title roachtest: tlp failed colfetcher: incorrect decoding of unique secondary indexes with multiple column families Sep 1, 2021
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants