Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: -0 is not converted to 0, causing spurious tlp failure #77279

Closed
cockroach-teamcity opened this issue Mar 2, 2022 · 13 comments · Fixed by #79473
Closed

roachtest: -0 is not converted to 0, causing spurious tlp failure #77279

cockroach-teamcity opened this issue Mar 2, 2022 · 13 comments · Fixed by #79473
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 2, 2022

roachtest.tlp failed with artifacts on master @ e151a24dc1d495bf694e1e1c3ee8acb518c1424e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tlp/run_1
	tlp.go:170,tlp.go:76,test_runner.go:779: expected unpartitioned and partitioned results to be equal
		(1) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLPQuery.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:243
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runWithTimeout.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:268
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) expected unpartitioned and partitioned results to be equal
		  |   []string{
		  | + 	"-0",
		  |   	"-0.08327121444050486",
		  |   	"-0.3325643758917869",
		  |   	... // 3 identical elements
		  |   	"-1.029317916464786",
		  |   	"-2.5910802100934904",
		  | - 	"0",
		  |   	"0.34983588080578065",
		  |   	"0.4833675094669227",
		  |   	... // 4 identical elements
		  |   }
		  | sql: SELECT DISTINCT tab_43377.col1_0 FROM defaultdb.public.table1 AS tab_43377
		  | (SELECT DISTINCT tab_43377.col1_0 FROM defaultdb.public.table1 AS tab_43377 WHERE (tab_43377.col1_2 <= tab_43377.col1_2)) UNION (SELECT DISTINCT tab_43377.col1_0 FROM defaultdb.public.table1 AS tab_43377 WHERE NOT ((tab_43377.col1_2 <= tab_43377.col1_2))) UNION (SELECT DISTINCT tab_43377.col1_0 FROM defaultdb.public.table1 AS tab_43377 WHERE ((tab_43377.col1_2 <= tab_43377.col1_2)) IS NULL)
		  | with args: []
		Error types: (1) *withstack.withStack (2) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-13514

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 2, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Mar 2, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ 0e1edf3ad9da29b1d3bb28fcfc0f293d700e3cb9:

		  |   	"-0.6010823668975593,NULL,2003783032,fwAUY\x11,0,3250010446,2014-07-"...,
		  | - 	"-0.6010823668975593,NULL,2003783032,fwAUY\x11,0,3250010446,2014-07-27 14:09:05.000551 +0000 +0000,NULL,1101431691,\x1b\x1c\x05n\x1eu,\x03#~,',NULL,0.1756910732812027,0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.911081"...,
		  |   	"-0.6010823668975593,NULL,2003783032,fwAUY\x11,0,3250010446,2014-07-"...,
		  |   	"-0.6010823668975593,NULL,2003783032,fwAUY\x11,0,3250010446,2014-07-"...,
		  |   	"-0.6010823668975593,NULL,2003783032,fwAUY\x11,0,3250010446,2014-07-"...,
		  | + 	"-1.612266124236562,#+z 7\x02nX,457251543,\u007fDF:\"\x0e,NULL,2503567087,1972-08-02 14:17:37.000088 +0000 +0000,3257923258,2153347998,\n\x1a=\x10\x1au \b,\x03#~,zP\x1b,\x17d*,#+z 7\x02nx,-0.8354926840577999,-0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.92183"...,
		  |   	"-1.612266124236562,#+z 7\x02nX,457251543,\u007fDF:\"\x0e,NULL,2503567087,197"...,
		  |   	"-1.612266124236562,#+z 7\x02nX,457251543,\u007fDF:\"\x0e,NULL,2503567087,197"...,
		  |   	"-1.612266124236562,#+z 7\x02nX,457251543,\u007fDF:\"\x0e,NULL,2503567087,197"...,
		  | - 	"-1.612266124236562,#+z 7\x02nX,457251543,\u007fDF:\"\x0e,NULL,2503567087,1972-08-02 14:17:37.000088 +0000 +0000,3257923258,2153347998,\n\x1a=\x10\x1au \b,\x03#~,zP\x1b,\x17d*,#+z 7\x02nx,-0.8354926840577999,0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.921836"...,
		  |   	"-1.612266124236562,#+z 7\x02nX,457251543,\u007fDF:\"\x0e,NULL,2503567087,197"...,
		  |   	"-1.612266124236562,#+z 7\x02nX,457251543,\u007fDF:\"\x0e,NULL,2503567087,197"...,
		  |   	"-1.612266124236562,#+z 7\x02nX,457251543,\u007fDF:\"\x0e,NULL,2503567087,197"...,
		  | + 	"0,NULL,NULL,\x13\a,3087847003,3837798169,1972-08-02 14:17:37.000088 +0000 +0000,1985245865,0,\x1b\x1c\x05n\x1eu,\x03#~,',NULL,0.776773440178762,-0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.9110816121101379,1.046212196"...,
		  |   	"0,NULL,NULL,\x13\a,3087847003,3837798169,1972-08-02 14:17:37.000088 "...,
		  |   	"0,NULL,NULL,\x13\a,3087847003,3837798169,1972-08-02 14:17:37.000088 "...,
		  |   	"0,NULL,NULL,\x13\a,3087847003,3837798169,1972-08-02 14:17:37.000088 "...,
		  | - 	"0,NULL,NULL,\x13\a,3087847003,3837798169,1972-08-02 14:17:37.000088 +0000 +0000,1985245865,0,\x1b\x1c\x05n\x1eu,\x03#~,',NULL,0.776773440178762,0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.9110816121101379,1.0462121963"...,
		  |   	"0,NULL,NULL,\x13\a,3087847003,3837798169,1972-08-02 14:17:37.000088 "...,
		  |   	"0,NULL,NULL,\x13\a,3087847003,3837798169,1972-08-02 14:17:37.000088 "...,
		  |   	"0,NULL,NULL,\x13\a,3087847003,3837798169,1972-08-02 14:17:37.000088 "...,
		  | + 	"0,NULL,NULL,P@,NULL,1733338272,1972-08-02 14:17:37.000088 +0000 +0000,NULL,1833907620,\n\x1a=\x10\x1au \b,\x03#~,zP\x1b,\x17d*,NULL,0.776773440178762,-0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.9110816121101379,1.0462"...,
		  |   	"0,NULL,NULL,P@,NULL,1733338272,1972-08-02 14:17:37.000088 +0000 "...,
		  |   	"0,NULL,NULL,P@,NULL,1733338272,1972-08-02 14:17:37.000088 +0000 "...,
		  |   	"0,NULL,NULL,P@,NULL,1733338272,1972-08-02 14:17:37.000088 +0000 "...,
		  | - 	"0,NULL,NULL,P@,NULL,1733338272,1972-08-02 14:17:37.000088 +0000 +0000,NULL,1833907620,\n\x1a=\x10\x1au \b,\x03#~,zP\x1b,\x17d*,NULL,0.776773440178762,0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.9110816121101379,1.04621"...,
		  |   	"0,NULL,NULL,P@,NULL,1733338272,1972-08-02 14:17:37.000088 +0000 "...,
		  |   	"0,NULL,NULL,P@,NULL,1733338272,1972-08-02 14:17:37.000088 +0000 "...,
		  |   	"0,NULL,NULL,P@,NULL,1733338272,1972-08-02 14:17:37.000088 +0000 "...,
		  | + 	"0.7252422224111514,NULL,NULL,\x1aYIX],NULL,1115311939,1972-08-02 14:17:37.000088 +0000 +0000,NULL,NULL,\n\x1a=\x10\x1au \b,\x03#~,zP\x1b,\x17d*,NULL,1.5020156625899135,-0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.91108161"...,
		  |   	"0.7252422224111514,NULL,NULL,\x1aYIX],NULL,1115311939,1972-08-02 14"...,
		  |   	"0.7252422224111514,NULL,NULL,\x1aYIX],NULL,1115311939,1972-08-02 14"...,
		  |   	"0.7252422224111514,NULL,NULL,\x1aYIX],NULL,1115311939,1972-08-02 14"...,
		  | - 	"0.7252422224111514,NULL,NULL,\x1aYIX],NULL,1115311939,1972-08-02 14:17:37.000088 +0000 +0000,NULL,NULL,\n\x1a=\x10\x1au \b,\x03#~,zP\x1b,\x17d*,NULL,1.5020156625899135,0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.911081612"...,
		  |   	"0.7252422224111514,NULL,NULL,\x1aYIX],NULL,1115311939,1972-08-02 14"...,
		  |   	"0.7252422224111514,NULL,NULL,\x1aYIX],NULL,1115311939,1972-08-02 14"...,
		  |   	"0.7252422224111514,NULL,NULL,\x1aYIX],NULL,1115311939,1972-08-02 14"...,
		  | + 	"NaN,M\x0e\x06w\x18}bE#,1030807826,\x12,2859333651,872033927,1989-08-06 19:33:34.000269 +0000 +0000,NULL,NULL,\x1b\x1c\x05n\x1eu,\x03#~,',m\x0e\x06w\x18}be#,NaN,-0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.9110816121101379,1.0462121963"...,
		  |   	"NaN,M\x0e\x06w\x18}bE#,1030807826,\x12,2859333651,872033927,1989-08-06 19:33"...,
		  |   	"NaN,M\x0e\x06w\x18}bE#,1030807826,\x12,2859333651,872033927,1989-08-06 19:33"...,
		  |   	"NaN,M\x0e\x06w\x18}bE#,1030807826,\x12,2859333651,872033927,1989-08-06 19:33"...,
		  | - 	"NaN,M\x0e\x06w\x18}bE#,1030807826,\x12,2859333651,872033927,1989-08-06 19:33:34.000269 +0000 +0000,NULL,NULL,\x1b\x1c\x05n\x1eu,\x03#~,',m\x0e\x06w\x18}be#,NaN,0,32121,NULL,p,NULL,4.681196838114059080E+25,false,46811968381140590800492426.9218368489017,\x02\x1e\x03wa),-0.9110816121101379,1.04621219635"...,
		  |   	"NaN,M\x0e\x06w\x18}bE#,1030807826,\x12,2859333651,872033927,1989-08-06 19:33"...,
		  |   	"NaN,M\x0e\x06w\x18}bE#,1030807826,\x12,2859333651,872033927,1989-08-06 19:33"...,
		  |   	"NaN,M\x0e\x06w\x18}bE#,1030807826,\x12,2859333651,872033927,1989-08-06 19:33"...,
		  |   }
		  | sql: SELECT * FROM defaultdb.public.table2 AS tab_1822 JOIN defaultdb.public.table3 AS tab_1823 ON true
		  | (SELECT * FROM defaultdb.public.table2 AS tab_1822 JOIN defaultdb.public.table3 AS tab_1823 ON tab_1823.col3_6) UNION ALL (SELECT * FROM defaultdb.public.table2 AS tab_1822 JOIN defaultdb.public.table3 AS tab_1823 ON NOT (tab_1823.col3_6)) UNION ALL (SELECT * FROM defaultdb.public.table2 AS tab_1822 JOIN defaultdb.public.table3 AS tab_1823 ON (tab_1823.col3_6) IS NULL)
		  | with args: []
		Error types: (1) *withstack.withStack (2) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@rytaft
Copy link
Collaborator

rytaft commented Mar 4, 2022

Neither of these failures seem to be a logic error, as the only difference in output is that one includes 0 while the other includes -0. We should figure out how to remove this source of spurious failures.

@rytaft rytaft removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 4, 2022
@rytaft rytaft changed the title roachtest: tlp failed roachtest: -0 is not converted to 0, causing spurious tlp failure Mar 4, 2022
@michae2
Copy link
Collaborator

michae2 commented Apr 5, 2022

Latest one reduced down to:

CREATE TABLE table1 (
  col1_0  UUID,
  col1_1  DATE,
  col1_3  INT8,
  col1_4  BIT(48) NULL,
  col1_5  BOOL,
  col1_6  FLOAT4,
  col1_7  REGPROC,
  col1_8  BIT(9),
  col1_10 TIME,
  col1_12 STRING AS (lower(CAST(col1_10 AS STRING))) VIRTUAL,
  PRIMARY KEY (col1_6 DESC),
  UNIQUE (lower(CAST(col1_5 AS STRING)), col1_12 ASC) STORING(col1_1, col1_4, col1_8),
  UNIQUE (col1_12 DESC),
  INDEX (col1_1 ASC),
  INDEX (col1_8)
);

INSERT
  INTO table1 (col1_0, col1_1, col1_3, col1_5, col1_6)
VALUES ('a4ecdba0-6c5c-4427-b6d2-a856d72d1b29', '1979-10-30', 0, false, 0);

SELECT col1_6 FROM table1;
SELECT col1_6 FROM table1 WHERE NOT col1_5;

Interestingly we never insert -0 into the table.

@michae2
Copy link
Collaborator

michae2 commented Apr 5, 2022

So we have some kind of corruption:

[email protected]:26257/defaultdb> SELECT col1_6 FROM table1@table1_pkey;
  col1_6
----------
      -0
(1 row)


Time: 1ms total (execution 1ms / network 0ms)

[email protected]:26257/defaultdb> SELECT col1_6 FROM table1@table1_expr_col1_12_key;
  col1_6
----------
       0
(1 row)


Time: 1ms total (execution 1ms / network 0ms)

[email protected]:26257/defaultdb> SELECT col1_6 FROM table1@table1_col1_12_key;
  col1_6
----------
       0
(1 row)


Time: 1ms total (execution 1ms / network 0ms)

[email protected]:26257/defaultdb> SELECT col1_6 FROM table1@table1_col1_1_idx;
  col1_6
----------
       0
(1 row)


Time: 2ms total (execution 1ms / network 0ms)

[email protected]:26257/defaultdb> SELECT col1_6 FROM table1@table1_col1_8_idx;
  col1_6
----------
       0
(1 row)


Time: 1ms total (execution 1ms / network 0ms)

@michae2
Copy link
Collaborator

michae2 commented Apr 5, 2022

Reduced a little further:

CREATE TABLE table1 (
  col1_0  UUID,
  col1_1  DATE,
  col1_3  INT8,
  col1_4  BIT(48),
  col1_5  BOOL,
  col1_6  FLOAT4,
  col1_8  BIT(9),
  col1_10 TIME,
  col1_12 STRING,
  PRIMARY KEY (col1_6 DESC),
  UNIQUE (col1_12 ASC) STORING(col1_1, col1_4, col1_8),
  INDEX (col1_1 ASC)
);

INSERT
  INTO table1 (col1_0, col1_1, col1_3, col1_5, col1_6)
VALUES ('a4ecdba0-6c5c-4427-b6d2-a856d72d1b29', '1979-10-30', 0, false, 0);

SELECT col1_6 FROM table1@table1_pkey;

@yuzefovich
Copy link
Member

I couldn't stop myself from looking into this :)

One way to fix this is:

--- a/pkg/util/encoding/float.go
+++ b/pkg/util/encoding/float.go
@@ -93,5 +93,13 @@ func DecodeFloatAscending(buf []byte) ([]byte, float64, error) {
 // DecodeFloatDescending decodes floats encoded with EncodeFloatDescending.
 func DecodeFloatDescending(buf []byte) ([]byte, float64, error) {
        b, r, err := DecodeFloatAscending(buf)
-       return b, -r, err
+       if r != 0 {
+               // All values except for 0 and NaN were negated in
+               // EncodeFloatDescending, so we have to negate them back. Note that we
+               // don't need to check whether r is NaN since negating NaN gives back
+               // NaN too. Negative zero uses composite indexes to decode itself
+               // correctly.
+               r = -r
+       }
+       return b, r, err
 }

@yuzefovich
Copy link
Member

Simplest repro:

CREATE TABLE t (f FLOAT, PRIMARY KEY (f DESC));
INSERT INTO t VALUES (0);
SELECT * FROM t;

@michae2
Copy link
Collaborator

michae2 commented Apr 5, 2022

That was fast!

@mgartner
Copy link
Collaborator

mgartner commented Apr 5, 2022

Is it possible that this wasn't a TLP false-positive all along, and instead this bug?

@michae2
Copy link
Collaborator

michae2 commented Apr 5, 2022

Is it possible that this wasn't a TLP false-positive all along, and instead this bug?

I reduced the failure involving MAX to:

CREATE TABLE table1 (
  col1_0 FLOAT8 NOT NULL,
  PRIMARY KEY (col1_0 ASC),
  INDEX (col1_0 DESC),
  INDEX (col1_0 ASC),
  INDEX (col1_0 DESC)
);

INSERT INTO table1 (col1_0) VALUES (0);

And in this case I think one query uses an ascending index and the other a descending index (the bug only affects descending indexes).

So yeah, I think this bug was probably the underlying issue for all of these.

@mgartner
Copy link
Collaborator

mgartner commented Apr 5, 2022

That's great news! Shame on us for doubting TLP. 😉

michae2 added a commit to michae2/cockroach that referenced this issue Apr 5, 2022
Fixes: cockroachdb#77279

Our old friend -0 is back. This time the problem was in
DecodeFloatDescending, which was decoding both +0 and -0 as -0. Fix it
to decode both as +0. (-0 is then properly decoded as -0 when the value
part of the composite encoding is decoded.)

Note that the on-disk data was correct. Only the decoding was affected.

Release note (bug fix): Queries reading from an index or primary key on
FLOAT or REAL columns DESC would read -0 for every +0 value stored in the
index. Fix this to correctly read +0 for +0 and -0 for -0.

Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich
Copy link
Member

Well, during the last collab session we did come up with a valid case when TLP could produce a false positive with 0 vs -0:

CREATE TABLE t (b BOOL, f FLOAT);
INSERT INTO t VALUES (false, 0), (true, -0);
SELECT max(f) FROM t; -- unpartitioned, returns 0
SELECT max(agg) FROM (SELECT max(f) AS agg FROM t WHERE b UNION ALL SELECT max(f) AS agg FROM t WHERE NOT b UNION ALL SELECT max(f) FROM t WHERE b IS NULL); -- partitioned, returns -0

@michae2
Copy link
Collaborator

michae2 commented Apr 6, 2022

@yuzefovich good point, thanks for reminding me. I'll go ahead and make changes to TLP, too.

craig bot pushed a commit that referenced this issue Apr 6, 2022
79085: ingesting,backupccl: fix bug in privileges of restored descriptors r=adityamaru a=adityamaru

Previously, when restoring a table descriptor as part of a non-cluster
backup, we were incorrectly generating its privileges without passing in
the parent schema's DefaultPrivilegeDescriptor. This change fixes that.

Additionally, this change rewrites the `restore-grants` datadriven test
to be more understandble. There were also some errors in the test that have
been fixed in this change. The expected behavior is that:

1) A cluster restore will just restore all privileges as was backed up.

2) A database restore will first generate ALL privileges for root and admin on the
database, and subsequent table, type, schema descriptors will inherit these privileges.

3) A table restore into an existing database will generate its set of privileges
based on the default privileges of the parentDB and parentSchema.

All restored schema objects are owned by the user running the restore.

All backed up privileges on these descriptors are wiped, since there is no
guarantee that a backed up user will be present in the restoring cluster.

Release note (bug fix): Privileges for restored table's were being generated
incorrectly without taking into consideration their parent schema's default privilege
descriptor.

79407: cloud: bump orchestrator to v21.2.8 r=aliher1911 a=celiala

Release note: None

Jira issue: CRDB-14811

79473: encoding: fix DecodeFloatDescending for positive zero r=yuzefovich,rytaft,mgartner a=michae2

Fixes: #77279

Our old friend -0 is back. This time the problem was in
DecodeFloatDescending, which was decoding both +0 and -0 as -0. Fix it
to decode both as +0. (-0 is then properly decoded as -0 when the value
part of the composite encoding is decoded.)

Note that the on-disk data was correct. Only the decoding was affected.

Release note (bug fix): Queries reading from an index or primary key on
FLOAT or REAL columns DESC would read -0 for every +0 value stored in the
index. Fix this to correctly read +0 for +0 and -0 for -0.

Co-authored-by: Yahor Yuzefovich <[email protected]>

Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Celia La <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
@craig craig bot closed this as completed in b5bcb19 Apr 6, 2022
blathers-crl bot pushed a commit that referenced this issue Apr 6, 2022
Fixes: #77279

Our old friend -0 is back. This time the problem was in
DecodeFloatDescending, which was decoding both +0 and -0 as -0. Fix it
to decode both as +0. (-0 is then properly decoded as -0 when the value
part of the composite encoding is decoded.)

Note that the on-disk data was correct. Only the decoding was affected.

Release note (bug fix): Queries reading from an index or primary key on
FLOAT or REAL columns DESC would read -0 for every +0 value stored in the
index. Fix this to correctly read +0 for +0 and -0 for -0.

Co-authored-by: Yahor Yuzefovich <[email protected]>
michae2 added a commit that referenced this issue Apr 6, 2022
Fixes: #77279

Our old friend -0 is back. This time the problem was in
DecodeFloatDescending, which was decoding both +0 and -0 as -0. Fix it
to decode both as +0. (-0 is then properly decoded as -0 when the value
part of the composite encoding is decoded.)

Note that the on-disk data was correct. Only the decoding was affected.

Release note (bug fix): Queries reading from an index or primary key on
FLOAT or REAL columns DESC would read -0 for every +0 value stored in the
index. Fix this to correctly read +0 for +0 and -0 for -0.

Release justification: low-risk change to existing functionality.

Co-authored-by: Yahor Yuzefovich <[email protected]>
michae2 added a commit to michae2/cockroach that referenced this issue Apr 6, 2022
Assists: cockroachdb#77279

TLP compares results of two queries, one unpartitioned and one
partitioned three ways by a predicate. We were comparing TLP results by
printing values to strings. Unfortunately there are some values which
are equal according to SQL comparison, but print differently. And TLP,
being the nefarious chaos monkey that it is, tends to find these. For
example:

```sql
  0::FLOAT
  -0::FLOAT

  'hello' COLLATE "en-US-u-ks-level2"
  'HELLO' COLLATE "en-US-u-ks-level2"

  INTERVAL '1 day'
  INTERVAL '24 hours'
```

When equal-yet-different values like these are used in MAX or MIN
aggregations, or are the keys for GROUP BY bucketing, it is
nondeterministic which value will be chosen. (This is also true in
PostgreSQL.) And sometimes the two TLP queries choose differently. This
is not indicative of a bug, yet TLP fails.

So this patch teaches TLP to use SQL comparison instead of string
comparison. To do so, we wrap both queries in a bigger query:

```sql
  WITH unpart AS MATERIALIZED (
    <unpartitioned query>
  ), part AS MATERIALIZED (
    <partitioned query>
  ), undiff AS (
    TABLE unpart EXCEPT ALL TABLE part
  ), diff AS (
    TABLE part EXCEPT ALL TABLE unpart
  )
  SELECT (SELECT count(*) FROM undiff), (SELECT count(*) FROM diff)
```

Only if this query returns counts other than 0 (meaning the SQL
comparison detected a real mismatch) do we diff the printed results.

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Apr 8, 2022
Assists: cockroachdb#77279

TLP compares results of two queries, one unpartitioned and one
partitioned three ways by a predicate. We were comparing TLP results by
printing values to strings. Unfortunately there are some values which
are equal according to SQL comparison, but print differently. And TLP,
being the nefarious chaos monkey that it is, tends to find these. For
example:

```sql
  0::FLOAT
  -0::FLOAT

  'hello' COLLATE "en-US-u-ks-level2"
  'HELLO' COLLATE "en-US-u-ks-level2"

  INTERVAL '1 day'
  INTERVAL '24 hours'
```

When equal-yet-different values like these are used in MAX or MIN
aggregations, or are the keys for GROUP BY bucketing, it is
nondeterministic which value will be chosen. (This is also true in
PostgreSQL.) And sometimes the two TLP queries choose differently. This
is not indicative of a bug, yet TLP fails.

So this patch teaches TLP to use SQL comparison instead of string
comparison. To do so, we wrap both queries in a bigger query:

```sql
  WITH unpart AS MATERIALIZED (
    <unpartitioned query>
  ), part AS MATERIALIZED (
    <partitioned query>
  ), undiff AS (
    TABLE unpart EXCEPT ALL TABLE part
  ), diff AS (
    TABLE part EXCEPT ALL TABLE unpart
  )
  SELECT (SELECT count(*) FROM undiff), (SELECT count(*) FROM diff)
```

Only if this query returns counts other than 0 (meaning the SQL
comparison detected a real mismatch) do we diff the printed results.

Release note: None
craig bot pushed a commit that referenced this issue Apr 11, 2022
76504: *: fix system config for tenant r=ajwerner a=ajwerner

Follow-on from #76279.

Fixes #75864.

This commit adds a mechanism to combine the system config data of the tenant
with the data provided over the GossipSubscription from the system tenant.
It then plumbs a version into the zone config methods. In the mixed version
state, the tenant uses the existing override from the system tenant. After
the span config infrastructure has been activated, the tenant uses the
overrides they've set. This affects, realistically, just the GC job, and,
to a lesser extent, the optimizer.

Release note: None

78595: Week -1: follow-up emails r=rail a=celiala

This PR adds follow-up emails for:

- the day before `PrepDate` (to be sent if only there are still blockers)
- the day of `PrepDate` (to be sent if only there are still blockers)
- the day after `PrepDate` (to be sent if only there are still blockers)

The `post-blockers` command should be used for all emails before and up to `PrepDate`:
```
# where `DAYS_BEFORE_PREP` is one of: many-days | day-before | day-of

SMTP_PASSWORD=$MY_SMTP_PWD SMTP_USER=$MY_SMTP_USER GITHUB_TOKEN=$MY_GITHUB_TOKEN \
$(bazel info bazel-bin)/pkg/cmd/release/release_/release post-blockers \
  --release-series 21.1 \
  --smtp-user $MY_SMTP_USER --smtp-host smtp.gmail.com --smtp-port 587 \
  --to $MY_TO_EMAIL \
  --template-dir ./pkg/cmd/release/templates \
  --prep-date "Apr 1, 2022" --publish-date "May 1, 2022" \
  --days-before-prep-date $DAYS_BEFORE_PREP
```

The `cancel-release-date` should be used for the day after `PrepDate` email:
```
SMTP_PASSWORD=$MY_SMTP_PWD SMTP_USER=$MY_SMTP_USER GITHUB_TOKEN=$MY_GITHUB_TOKEN \
$(bazel info bazel-bin)/pkg/cmd/release/release_/release cancel-release-date \
  --release-series 21.1 \
  --smtp-user $MY_SMTP_USER --smtp-host smtp.gmail.com --smtp-port 587 \
  --to $MY_TO_EMAIL \
  --template-dir ./pkg/cmd/release/templates \
  --publish-date "May 1, 2022" --next-publish-date "Jun 1, 2022"
```

### TeamCity changes

This will be paired with the following TeamCity jobs:
*  [Send release blockers - 78595](https://teamcity.cockroachdb.com/buildConfiguration/Internal_Cockroach_Release_Process_SendReleaseBlockers_78595?branch=78595&buildTypeTab=overview&mode=builds#all-projects) 
* [Cancel release date - 78595](https://teamcity.cockroachdb.com/buildConfiguration/Internal_Cockroach_Release_Process_CancelReleaseDate_78595?branch=78595&mode=builds#all-projects)

![image](https://user-images.githubusercontent.com/3051672/160359980-93ac438f-d8e6-4f6c-a0e0-00fab55b93bc.png)


#### Manual tests

##### Test 1: post-blockers --days-before-prep-date many-days

<img width="300" src="https://user-images.githubusercontent.com/3051672/160353978-12e8d2e9-1900-471b-b785-f247d19f474a.png">

##### Test 2: post-blockers --days-before-prep-date day-before

<img width="300" src="https://user-images.githubusercontent.com/3051672/160353943-8514a0d4-87f4-4ab7-8d67-78bb42f427c2.png">

##### Test 3: post-blockers --days-before-prep-date day-of

<img width="300" src="https://user-images.githubusercontent.com/3051672/160353904-d18e6d68-594c-48dc-8308-6f0c3621fe91.png">

##### Test 4: cancel-release-date

<img width="300" src="https://user-images.githubusercontent.com/3051672/160353668-1e2886f5-c39f-4d54-884d-f967ba06a114.png">


79551: roachtest: use SQL comparison in TLP r=michae2 a=michae2

Assists: #77279

TLP compares results of two queries, one unpartitioned and one
partitioned three ways by a predicate. We were comparing TLP results by
printing values to strings. Unfortunately there are some values which
are equal according to SQL comparison, but print differently. And TLP,
being the nefarious chaos monkey that it is, tends to find these. For
example:

```sql
  0::FLOAT
  -0::FLOAT

  'hello' COLLATE "en-US-u-ks-level2"
  'HELLO' COLLATE "en-US-u-ks-level2"

  INTERVAL '1 day'
  INTERVAL '24 hours'
```

When equal-yet-different values like these are used in MAX or MIN
aggregations, or are the keys for GROUP BY bucketing, it is
nondeterministic which value will be chosen. (This is also true in
PostgreSQL.) And sometimes the two TLP queries choose differently. This
is not indicative of a bug, yet TLP fails.

So this patch teaches TLP to use SQL comparison instead of string
comparison. To do so, we wrap both queries in a bigger query:

```sql
  WITH unpart AS MATERIALIZED (
    <unpartitioned query>
  ), part AS MATERIALIZED (
    <partitioned query>
  ), undiff AS (
    TABLE unpart EXCEPT ALL TABLE part
  ), diff AS (
    TABLE part EXCEPT ALL TABLE unpart
  )
  SELECT (SELECT count(*) FROM undiff), (SELECT count(*) FROM diff)
```

Only if this query returns counts other than 0 (meaning the SQL
comparison detected a real mismatch) do we diff the printed results.

Release note: None

79767: util/contextutil: clarify `RunWithTimeout` error message r=erikgrinaker a=erikgrinaker

`RunWithTimeout` gave an error message of this form:

```
operation "send-snapshot" timed out after 1m (took 1m): context deadline exceeded
```

However, this can be misleading because either the caller or callee can
set their own context timeout that is less than then one set by
`RunWithTimeout`, in which case it looks like:

```
operation "send-snapshot" timed out after 1m (took 1s): context deadline exceeded
```

This is even worse on old versions, where the "took 1s" is not included,
leading the reader to believe the operation actually took 1m when it
only took 1s.

This patch clarifies the error message somewhat:

```
operation "send-snapshot" timed out after 1s (given timeout 1m): context deadline exceeded
```

Resolves #79424.

Release note: None

79787: protectedts: fix panic when fetching protectedts records r=adityamaru a=adityamaru

Previously, we incorrectly assumed that all records in the
`system.protectedts_records` table would have a non NULL target
column. While this is enforced for all protected timestamp
records written in a 22.1+ cluster, this is not true for records
that might have been prior to the 22.1 cluster version being
finalized.

This change makes the method responsible for reading protectedts
records more defensive when encountering a NULL target column.

Fixes: #79684

Release note: None

79796: Publish cockroach-sql in publish bleeding edge r=rickystewart a=rail

Previously, the `cockroach-sql` tool was published only as a part of the
release workflow.

The first commit refactors the way how we publish and test. This is very similar to what we use in publish-provisional-artifacts.

The second commit adds the cockroach-sql binary to the list of artifacts to publish.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Celia La <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
blathers-crl bot pushed a commit that referenced this issue May 4, 2022
Assists: #77279

TLP compares results of two queries, one unpartitioned and one
partitioned three ways by a predicate. We were comparing TLP results by
printing values to strings. Unfortunately there are some values which
are equal according to SQL comparison, but print differently. And TLP,
being the nefarious chaos monkey that it is, tends to find these. For
example:

```sql
  0::FLOAT
  -0::FLOAT

  'hello' COLLATE "en-US-u-ks-level2"
  'HELLO' COLLATE "en-US-u-ks-level2"

  INTERVAL '1 day'
  INTERVAL '24 hours'
```

When equal-yet-different values like these are used in MAX or MIN
aggregations, or are the keys for GROUP BY bucketing, it is
nondeterministic which value will be chosen. (This is also true in
PostgreSQL.) And sometimes the two TLP queries choose differently. This
is not indicative of a bug, yet TLP fails.

So this patch teaches TLP to use SQL comparison instead of string
comparison. To do so, we wrap both queries in a bigger query:

```sql
  WITH unpart AS MATERIALIZED (
    <unpartitioned query>
  ), part AS MATERIALIZED (
    <partitioned query>
  ), undiff AS (
    TABLE unpart EXCEPT ALL TABLE part
  ), diff AS (
    TABLE part EXCEPT ALL TABLE unpart
  )
  SELECT (SELECT count(*) FROM undiff), (SELECT count(*) FROM diff)
```

Only if this query returns counts other than 0 (meaning the SQL
comparison detected a real mismatch) do we diff the printed results.

Release note: None
mgartner pushed a commit to mgartner/cockroach that referenced this issue May 31, 2022
Assists: cockroachdb#77279

TLP compares results of two queries, one unpartitioned and one
partitioned three ways by a predicate. We were comparing TLP results by
printing values to strings. Unfortunately there are some values which
are equal according to SQL comparison, but print differently. And TLP,
being the nefarious chaos monkey that it is, tends to find these. For
example:

```sql
  0::FLOAT
  -0::FLOAT

  'hello' COLLATE "en-US-u-ks-level2"
  'HELLO' COLLATE "en-US-u-ks-level2"

  INTERVAL '1 day'
  INTERVAL '24 hours'
```

When equal-yet-different values like these are used in MAX or MIN
aggregations, or are the keys for GROUP BY bucketing, it is
nondeterministic which value will be chosen. (This is also true in
PostgreSQL.) And sometimes the two TLP queries choose differently. This
is not indicative of a bug, yet TLP fails.

So this patch teaches TLP to use SQL comparison instead of string
comparison. To do so, we wrap both queries in a bigger query:

```sql
  WITH unpart AS MATERIALIZED (
    <unpartitioned query>
  ), part AS MATERIALIZED (
    <partitioned query>
  ), undiff AS (
    TABLE unpart EXCEPT ALL TABLE part
  ), diff AS (
    TABLE part EXCEPT ALL TABLE unpart
  )
  SELECT (SELECT count(*) FROM undiff), (SELECT count(*) FROM diff)
```

Only if this query returns counts other than 0 (meaning the SQL
comparison detected a real mismatch) do we diff the printed results.

Release note: None
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants