Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of evaluation of aggregations causes precision issues for float #95665

Closed
cockroach-teamcity opened this issue Jan 22, 2023 · 9 comments · Fixed by #106552
Closed

Order of evaluation of aggregations causes precision issues for float #95665

cockroach-teamcity opened this issue Jan 22, 2023 · 9 comments · Fixed by #106552
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 22, 2023

roachtest.unoptimized-query-oracle/disable-rules=half/seed-multi-region failed with artifacts on master @ b21379bb56dd206e6f63cc7d07ca72e85db7a4c4:

test artifacts and logs in: /artifacts/unoptimized-query-oracle/disable-rules=half/seed-multi-region/run_1
(query_comparison_util.go:251).runOneRoundQueryComparison: failed to set random seed. 1026 statements run: dial tcp 34.138.224.32:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-23664

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 22, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jan 22, 2023
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jan 22, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.unoptimized-query-oracle/disable-rules=half/seed-multi-region failed with artifacts on master @ dc974cd698364a4db5acb630162dd8b1856cfad6:

test artifacts and logs in: /artifacts/unoptimized-query-oracle/disable-rules=half/seed-multi-region/run_1
(query_comparison_util.go:251).runOneRoundQueryComparison: . 1043 statements run: expected unoptimized and optimized results to be equal
  []string{
  	"0",
- 	"9.39932155415957e-48",
+ 	"1.4148125887129337e-47",
  }
sql: SELECT
	stddev(tab53._float8::FLOAT8)::FLOAT8 AS co̎l154
FROM
	defaultdb.public.seed_mr_table@[0] AS "%9ftab52"
	JOIN defaultdb.public.seed_mr_table@[0] AS tab53 ON ("%9ftab52"._int4) = (tab53._int8)
GROUP BY
	tab53._float8
HAVING
	every(tab53._bool::BOOL)::BOOL

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@michae2
Copy link
Collaborator

michae2 commented Jan 24, 2023

It would be good to check that in both cases we're using the same stddev algorithm, one of the algorithms that's more numerically stable. (I think we have at least one implementation based on Welford-Knuth but maybe we have more than one implementation?)

@cucaroach
Copy link
Contributor

Second failure is dupe of #95264

@msirek
Copy link
Contributor

msirek commented Jan 24, 2023

I couldn't reconstruct the stddev issue on first attempt, but here's a simplified test case with SUM illustrating the order of evaluation problem:

create table t1 (a float8, b int, c int, index c_idx(c desc) storing (a, b));
insert into t1 values (1e4, 1, 0);
insert into t1 values (1.00000000000000001e-48, 1, 1);
insert into t1 values (-1e4, 1, 2);
select sum(a) from t1@t1_pkey;
  sum
-------
    0

select sum(f) from (values (1e4), (1.00000000000000001e-48), (-1e4)) v(f);
            sum
---------------------------
  1.00000000000000001E-48

If the order of insertion of rows 2 and 3 in the table is swapped, we get the result:

select sum(a) from t1@t1_pkey;
   sum
---------
  1e-48

Here's a test case where selecting from the primary key vs. an index yields different results:

create table t1 (a float8, b int, c int, index c_idx(c desc) storing (a, b));
insert into t1 values (1e4, 1, 0);
insert into t1 values (-1e4, 1, 2);
insert into t1 values (1e-12, 1, 1);
select sum(a) from t1@t1_pkey;
   sum
---------
  1e-12

select sum(a) from t1@c_idx;
           sum
--------------------------
  1.8189894035458565e-12

@msirek
Copy link
Contributor

msirek commented Jan 25, 2023

The same issue occurs on postgres:

test=# create table t1 (a float8, b int, c int);
CREATE TABLE
test=# insert into t1 values (1e4, 1, 0);
INSERT 0 1
test=# insert into t1 values (1.00000000000000001e-48, 1, 1);
INSERT 0 1
test=# insert into t1 values (-1e4, 1, 2);
INSERT 0 1
test=# select sum(a) from t1;
 sum
-----
   0
(1 row)

test=# delete from t1;
DELETE 3
test=# insert into t1 values (1e4, 1, 0);
INSERT 0 1
test=# insert into t1 values (-1e4, 1, 2);
INSERT 0 1
test=# insert into t1 values (1.00000000000000001e-48, 1, 1);
INSERT 0 1
test=# select sum(a) from t1;
  sum
-------
 1e-48

@msirek msirek removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jan 27, 2023
@msirek
Copy link
Contributor

msirek commented Jan 27, 2023

Removing release-blocker label. Order of execution precision differences are nothing new, and are expected.

@msirek
Copy link
Contributor

msirek commented Jan 27, 2023

Precision differences could possibly be eliminated by using a different data type under the covers to perform the operation, like decimal. But this may use more memory and perform worse. Using float vs. decimal is a trade-off of performance vs. precision.

@msirek
Copy link
Contributor

msirek commented Jan 31, 2023

TLP may have a way to avoid this issue. Maybe this was not extended to unoptimized-query-oracle or costfuzz.

@msirek msirek changed the title roachtest: unoptimized-query-oracle/disable-rules=half/seed-multi-region failed Order of evaluation of aggregations causes precision issues for float Jan 31, 2023
@msirek msirek removed their assignment Jan 31, 2023
@rharding6373
Copy link
Collaborator

We discussed this family of issues in our collab session today. We decided this might be a good meta issue for tracking investigation into float precision in aggregation operations.

Options we discussed (in no particular order or recommendation):

  1. Only compare floats to some amount of precision.
  2. Don't use floats in unoptimized-query-oracle testing.
  3. Don't run queries with stddev or avg on floats in unoptimized-query-oracle testing.
  4. Constrain floats inserted into the test tables to a narrow range of precision.
  5. Use higher precision types (like decimal) internally for the sum and convert back after taking the average.

@rharding6373 rharding6373 self-assigned this Jun 29, 2023
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Jul 10, 2023
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.

This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see cockroachdb#63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.

Epic: None
Fixes: cockroachdb#95665

Release note: None
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Jul 17, 2023
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.

This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see cockroachdb#63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.

Epic: None
Fixes: cockroachdb#95665

Release note: None
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Jul 18, 2023
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.

This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see cockroachdb#63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.

Epic: None
Fixes: cockroachdb#95665

Release note: None
craig bot pushed a commit that referenced this issue Jul 24, 2023
106552: tests: support float approximation in roachtest query comparison utils r=rharding6373 a=rharding6373

tests, logictest, floatcmp: refactor comparison test util functions
    
This commit moves some float comparison test util functions from
logictest into the floatcmp package. It also moves a query result
comparison function from the tlp file to query_comparison_util in the
tests package.
    
This commit also marks roachtests as testonly targets.
    
Epic: none
    
Release note: None


tests: support float approximation in roachtest query comparison utils
    
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.
    
This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see #63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.
    
Epic: None
Fixes: #95665
    
Release note: None

106607: pkg/util/log: introduce `fluent.sink.conn.errors` metric to log package r=knz a=abarganier

Logging is a critical subsystem within CRDB, but as things
are today, we have very little observability into logging
itself. For starters, we have no metrics in the logging
package at all!

This makes it difficult to observe things within the log
package. For example, if a fluent-server log sink fails
to connect to FluentBit, how can we tell? We get some STDOUT
message, but that's about it.

It's time to get some metrics into the log package.

Doing so is a bit of a tricky dance, because pretty much every
package in CRDB imports the logging package, meaning you
almost always get a circular dependency when trying to make
use of any library within pkg/util/log. Therefore, this PR 
injects metrics functionality into the logging package.
It does so via a new interface called `LogMetrics` with
functions that enable its users to modify metrics.

The implementation of the interface can live elsewhere,
such as the metrics package itself, whe circular dependencies
aren't such a pain. We can then inject the implementation
into the log package.

With all that plumbing done, the PR also introduces a new metric,
`fluent.sink.conn.errors` representing fluentbit connection errors 
whenever a fluent-server log sink fails to establish a connection.

We can see the metric represented below in a test, where no
fluent-server was running for a moment, before starting it.
Note that the connection errors level off once the server was 
started:
<img width="1018" alt="Screenshot 2023-07-11 at 1 32 57 PM" src="https://github.com/cockroachdb/cockroach/assets/8194877/ccacf84e-e585-4a76-98af-ed145629f9ef">

Release note (ops change): This patch introduces the counter
metric `fluent.sink.conn.errors` to the CockroachDB tsdb,
which is exposed to `/_status/vars` clients as
`fluent_sink_conn_errors`. The metric is incremented whenever
a `fluent-server` log sink fails to establish a connection to
the log sink pointed to by the `address` for the sink in the
provided log config.

Epic: CC-9681

Addresses: #102753

107453: Revert "github-pull-request-make: temporary workaround" r=postamar a=postamar

This reverts commit 2bd61c0.

Relates to #106920.

Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Marius Posta <[email protected]>
@craig craig bot closed this as completed in c9999ae Jul 24, 2023
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 8, 2023
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.

This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see cockroachdb#63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.

Epic: None
Fixes: cockroachdb#95665

Release note: None
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 8, 2023
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.

This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see cockroachdb#63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.

Epic: None
Fixes: cockroachdb#95665

Release note: None
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 13, 2023
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.

This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see cockroachdb#63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.

Epic: None
Fixes: cockroachdb#95665

Release note: None
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 13, 2023
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.

This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see cockroachdb#63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.

Epic: None
Fixes: cockroachdb#95665

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants