Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentry: error.go:20: unexpected error: forecasted histogram had first bucket with non-zero NumRange or DistinctRange: [{"name":"__forecast__","created_at":"2024-11-01 08:09:36.42883425 +0000 UTC","col... #134031

Closed
cockroach-sentry opened this issue Nov 1, 2024 · 4 comments · Fixed by #135310
Assignees
Labels
branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner

Comments

@cockroach-sentry
Copy link
Collaborator

cockroach-sentry commented Nov 1, 2024

This issue was auto filed by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry Link: https://cockroach-labs.sentry.io/issues/6035499143/?referrer=webhooks_plugin

Panic Message:

error.go:20: unexpected error: forecasted histogram had first bucket with non-zero NumRange or DistinctRange: [{"name":"__forecast__","created_at":"2024-11-01 08:09:36.42883425 +0000 UTC","columns":["1"],"row_count":736,"distinct_count":736,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":1,"num_range":0,"distinct_range":2.8421709430404e-13,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":2,"distinct_range":1.6666666666666665,"upper_bound":"_c9e1074f5b3f9fc8ea15d152add07294"},{"num_eq":1,"num_range":37,"distinct_range":37.332640488629686,"upper_bound":"_3d087c2780c5886823ecc7f25321ff48"},{"num_eq":1,"num_range":2,"distinct_range":2.000001418622736,"upper_bound":"_43bf65a6a5fd236610bd7761a249300b"},{"num_eq":1,"num_range":2,"distinct_range":1.999999861337837,"upper_bound":"_71f0b2e1d679fb722d72c71c208d5722"}],"histo_version":3},{"name":"__merged__","created_at":"2024-11-01 08:05:35.503656 +0000 UTC","columns":["1"],"row_count":701,"distinct_count":701,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":1,"num_range":0,"distinct_range":2.8421709430404007e-13,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":2,"distinct_range":1.6666666666666667,"upper_bound":"_c9e1074f5b3f9fc8ea15d152add07294"},{"num_eq":1,"num_range":2,"distinct_range":2.3332740512770016,"upper_bound":"_3d087c2780c5886823ecc7f25321ff48"},{"num_eq":1,"num_range":2,"distinct_range":2.0000006690416345,"upper_bound":"_43bf65a6a5fd236610bd7761a249300b"},{"num_eq":1,"num_range":2,"distinct_range":1.999999229778512,"upper_bound":"_71f0b2e1d679fb722d72c71c208d5722"}],"histo_version":3},{"name":"__auto__","created_at":"2024-11-01 07:56:23.475349 +0000 UTC","columns":["1"],"row_count":633,"distinct_count":633,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":0,"num_range":0,"distinct_range":0,"upper_bound":"_e12c22bb0312e7872c49884f8304d882"},{"num_eq":1,"num_range":0,"distinct_range":2.8421709430404007e-13,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":2,"distinct_range":1.6666666666666667,"upper_bound":"_c9e1074f5b3f9fc8ea15d152add07294"},{"num_eq":1,"num_range":2,"distinct_range":2.3332740512767742,"upper_bound":"_3d087c2780c5886823ecc7f25321ff48"},{"num_eq":1,"num_range":2,"distinct_range":2.0000006690416345,"upper_bound":"_43bf65a6a5fd236610bd7761a249300b"}],"histo_version":3},{"name":"__auto__","created_at":"2024-11-01 07:53:21.834414 +0000 UTC","columns":["1"],"row_count":611,"distinct_count":611,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":0,"num_range":0,"distinct_range":0,"upper_bound":"_e12c22bb0312e7872c49884f8304d882"},{"num_eq":1,"num_range":0,"distinct_range":2.2737367544323206e-13,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":2,"distinct_range":1.6666666666666667,"upper_bound":"_c9e1074f5b3f9fc8ea15d152add07294"},{"num_eq":1,"num_range":2,"distinct_range":2.333193244846421,"upper_bound":"_3d087c2780c5886823ecc7f25321ff48"},{"num_eq":1,"num_range":2,"distinct_range":2.000000669039904,"upper_bound":"_43bf65a6a5fd236610bd7761a249300b"}],"histo_version":3},{"name":"__auto__","created_at":"2024-11-01 07:49:30.126659 +0000 UTC","columns":["1"],"row_count":578,"distinct_count":578,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":0,"num_range":0,"distinct_range":0,"upper_bound":"_e12c22bb0312e7872c49884f8304d882"},{"num_eq":1,"num_range":0,"distinct_range":1.1368683772161603e-13,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":1,"distinct_range":1,"upper_bound":"_6974ce5ac660610b44d9b9fed0ff9548"},{"num_eq":1,"num_range":1,"distinct_range":1,"upper_bound":"_65b9eea6e1cc6bb9f0cd2a47751a186f"},{"num_eq":1,"num_range":1,"distinct_range":1.0000501921926213,"upper_bound":"_3d087c2780c5886823ecc7f25321ff48"}],"histo_version":3},{"name":"__auto__","created_at":"2024-11-01 07:49:14.485717 +0000 UTC","columns":["1"],"row_count":578,"distinct_count":578,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":0,"num_range":0,"distinct_range":0,"upper_bound":"_e12c22bb0312e7872c49884f8304d882"},{"num_eq":1,"num_range":0,"distinct_range":1.1368683772161603e-13,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":1,"distinct_range":1,"upper_bound":"_6974ce5ac660610b44d9b9fed0ff9548"},{"num_eq":1,"num_range":1,"distinct_range":1,"upper_bound":"_65b9eea6e1cc6bb9f0cd2a47751a186f"},{"num_eq":1,"num_range":1,"distinct_range":1.0000501921926213,"upper_bound":"_3d087c2780c5886823ecc7f25321ff48"}],"histo_version":3},{"name":"__auto__","created_at":"2024-11-01 07:40:19.774636 +0000 UTC","columns":["1"],"row_count":501,"distinct_count":501,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":1,"num_range":0,"distinct_range":0,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":1,"distinct_range":0.9999998900910911,"upper_bound":"_6974ce5ac660610b44d9b9fed0ff9548"},{"num_eq":1,"num_range":1,"distinct_range":0.9999998900910911,"upper_bound":"_65b9eea6e1cc6bb9f0cd2a47751a186f"},{"num_eq":1,"num_range":1,"distinct_range":0.9999998900910911,"upper_bound":"_3d087c2780c5886823ecc7f25321ff48"},{"num_eq":1,"num_range":1,"distinct_range":1.0000008743075084,"upper_bound":"_425e2429f4d57338f8285f749e19007e"}],"histo_version":3}]
(1) ×
Wraps: (2) issue #93892
Wraps: (3) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/util/errorutil.UnexpectedWithIssueErrorf
  | 	pkg/util/errorutil/error.go:21
  | [...repeated from below...]
Wraps: (4) unexpected error
Wraps: (5) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/util/errorutil.UnexpectedWithIssueErrorf
  | 	pkg/util/errorutil/error.go:20
  | github.com/cockroachdb/cockroach/pkg/sql/stats.forecastColumnStatistics
  | 	pkg/sql/stats/forecast.go:381
  | github.com/cockroachdb/cockroach/pkg/sql/stats.ForecastTableStatistics
  | 	pkg/sql/stats/forecast.go:148
  | github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).getTableStatsFromDB
  | 	pkg/sql/stats/stats_cache.go:855
  | github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry.func1
  | 	pkg/sql/stats/stats_cache.go:502
  | github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry
  | 	pkg/sql/stats/stats_cache.go:504
  | github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).refreshTableStats.func1
  | 	pkg/sql/stats/stats_cache.go:530
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  | 	pkg/util/stop/stopper.go:498
  | runtime.goexit
  | 	src/runtime/asm_amd64.s:1695
Wraps: (6) forecasted histogram had first bucket with non-zero NumRange or DistinctRange: [{"name":"__forecast__","created_at":"2024-11-01 08:09:36.42883425 +0000 UTC","columns":["1"],"row_count":736,"distinct_count":736,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":1,"num_range":0,"distinct_range":2.8421709430404e-13,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":2,"distinct_range":1.6666666666666665,"upper_bound":"_c9e1074f5b3f9fc8ea15d152add07294"},{"num_eq":1,"num_range":37,"distinct_range":37.332640488629686,"upper_bound":"_3d087c2780c5886823ecc7f25321ff48"},{"num_eq":1,"num_range":2,"distinct_range":2.000001418622736,"upper_bound":"_43bf65a6a5fd236610bd7761a249300b"},{"num_eq":1,"num_range":2,"distinct_range":1.999999861337837,"upper_bound":"_71f0b2e1d679fb722d72c71c208d5722"}],"histo_version":3},{"name":"__merged__","created_at":"2024-11-01 08:05:35.503656 +0000 UTC","columns":["1"],"row_count":701,"distinct_count":701,"null_count":0,"avg_size":9,"histo_col_type":"INT8","histo_buckets":[{"num_eq":1,"num_range":0,"distinct_range":2.8421709430404007e-13,"upper_bound":"_f899139df5e1059396431415e770c6dd"},{"num_eq":1,"num_range":2,"distinct_range":1.6666666666666667,"upper_bound":"_c9e1074f5b3f9fc8ea15d152add07294"},{"num_eq":1,"num_range":2,"distinct_range":2.3332740512770016,"upper_bound":"_3d087c2780c5886823ecc7f253...
Stacktrace (expand for inline code snippets):

src/runtime/asm_amd64.s#L1694-L1696
pkg/util/stop/stopper.go#L497-L499
pkg/sql/stats/stats_cache.go#L529-L531
pkg/sql/stats/stats_cache.go#L503-L505
pkg/sql/stats/stats_cache.go#L501-L503
pkg/sql/stats/stats_cache.go#L854-L856
pkg/sql/stats/forecast.go#L147-L149
pkg/sql/stats/forecast.go#L380-L382
pkg/util/errorutil/error.go#L20-L22
src/runtime/asm_amd64.s#L1694-L1696
pkg/util/stop/stopper.go#L497-L499
pkg/sql/stats/stats_cache.go#L529-L531
pkg/sql/stats/stats_cache.go#L503-L505
pkg/sql/stats/stats_cache.go#L501-L503
pkg/sql/stats/stats_cache.go#L854-L856
pkg/sql/stats/forecast.go#L147-L149
pkg/sql/stats/forecast.go#L380-L382
pkg/util/errorutil/error.go#L19-L21

src/runtime/asm_amd64.s in runtime.goexit at line 1695
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 498
pkg/sql/stats/stats_cache.go in pkg/sql/stats.(*TableStatisticsCache).refreshTableStats.func1 at line 530
pkg/sql/stats/stats_cache.go in pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry at line 504
pkg/sql/stats/stats_cache.go in pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry.func1 at line 502
pkg/sql/stats/stats_cache.go in pkg/sql/stats.(*TableStatisticsCache).getTableStatsFromDB at line 855
pkg/sql/stats/forecast.go in pkg/sql/stats.ForecastTableStatistics at line 148
pkg/sql/stats/forecast.go in pkg/sql/stats.forecastColumnStatistics at line 381
pkg/util/errorutil/error.go in pkg/util/errorutil.UnexpectedWithIssueErrorf at line 21
src/runtime/asm_amd64.s in runtime.goexit at line 1695
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 498
pkg/sql/stats/stats_cache.go in pkg/sql/stats.(*TableStatisticsCache).refreshTableStats.func1 at line 530
pkg/sql/stats/stats_cache.go in pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry at line 504
pkg/sql/stats/stats_cache.go in pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry.func1 at line 502
pkg/sql/stats/stats_cache.go in pkg/sql/stats.(*TableStatisticsCache).getTableStatsFromDB at line 855
pkg/sql/stats/forecast.go in pkg/sql/stats.ForecastTableStatistics at line 148
pkg/sql/stats/forecast.go in pkg/sql/stats.forecastColumnStatistics at line 381
pkg/util/errorutil/error.go in pkg/util/errorutil.UnexpectedWithIssueErrorf at line 20

Tags

Tag Value
Command server
Environment v24.3.0-alpha.3
Go Version go1.22.5fips X:boringcrypto,nocoverageredesign
Platform linux amd64
Distribution CCL
Cockroach Release v24.3.0-alpha.2-861-gf9918d8f81a
Cockroach SHA f9918d8
# of CPUs 16
# of Goroutines 906

Jira issue: CRDB-43884

@cockroach-sentry cockroach-sentry added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Nov 1, 2024
Copy link

blathers-crl bot commented Nov 1, 2024

Hi @cockroach-sentry, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@yuzefovich yuzefovich added T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner labels Nov 12, 2024
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Nov 12, 2024
@yuzefovich
Copy link
Member

dup of #93892

@github-project-automation github-project-automation bot moved this from Triage to Done in SQL Queries Nov 12, 2024
@michae2
Copy link
Collaborator

michae2 commented Nov 14, 2024

In this one, it looks like we have some extra distinct count, so we call addOuterBuckets, and then while merging partial stats we call stripOuterBuckets, and then we forecast, and end up with a first bucket with the extra distinct count in the forecasted histogram.

@michae2
Copy link
Collaborator

michae2 commented Nov 14, 2024

Here's a repro:

CREATE TABLE a (
  a INT PRIMARY KEY
) WITH (sql_stats_automatic_collection_enabled = false);

INSERT INTO a SELECT 2 * i * i FROM generate_series(0, 1000) s(i);
CREATE STATISTICS __auto__ FROM a;

INSERT INTO a VALUES (2000002);
SELECT pg_sleep(1);
CREATE STATISTICS __auto__ FROM a;

INSERT INTO a VALUES (2000004);
SELECT pg_sleep(1);
CREATE STATISTICS __auto__ FROM a;

INSERT INTO a VALUES (3000000);
SELECT pg_sleep(2);
CREATE STATISTICS __auto_partial__ FROM a USING EXTREMES;

SHOW STATISTICS FOR TABLE a WITH MERGE, FORECAST;

SELECT jsonb_array_elements(stat->'histo_buckets')
FROM (
SELECT jsonb_array_elements(statistics) AS stat
FROM [SHOW STATISTICS USING JSON FOR TABLE a WITH MERGE, FORECAST]
)
WHERE stat->'name' <@ '["__merged__", "__forecast__"]';

@michae2 michae2 reopened this Nov 15, 2024
@github-project-automation github-project-automation bot moved this from Done to Triage in SQL Queries Nov 15, 2024
@michae2 michae2 self-assigned this Nov 15, 2024
@michae2 michae2 moved this from Triage to Active in SQL Queries Nov 15, 2024
michae2 added a commit to michae2/cockroach that referenced this issue Nov 15, 2024
Near the end of `stats.(*histogram).adjustCounts`, if there is leftover
DistinctCount after adjusting each histogram bucket, we call
`addOuterBuckets`. This function creates two "outer" buckets that fully
cover the lower and upper ends of the domain with the remaining
DistinctCount.

These "outer" buckets interfere with partial stats, so we remove them
with `stripOuterBuckets` when merging partial stats. `stripOuterBuckets`
was simply slicing off the two buckets, but this could potentially leave
non-zero DistinctCount in the first bucket (which is added by
`addOuterBuckets`). `stripOuterBuckets` needs to also zero the
DistinctCount and NumRange of the first bucket to fully undo the actions
of `addOuterBuckets`.

We were only catching this when trying to forecast using the merged
histogram, which would sometimes produce a forecasted histogram with
non-zero DistinctRange in the first bucket, leading to the infamous
"histogram had first bucket with non-zero NumRange or DistinctRange"
Sentry failure.

Unfortunately, I don't think this bug fully explains all of the Sentry
failures that look like cockroachdb#93892 because many of those predate the
introduction of partial stats. I believe this specific failure started
with v24.3.0-alpha.0.

Informs: cockroachdb#93892
Fixes: cockroachdb#134031

(No release note because automatic collection of partial stats is not yet
enabled in any available release.)

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Nov 16, 2024
Near the end of `stats.(*histogram).adjustCounts`, if there is leftover
DistinctCount after adjusting each histogram bucket, we call
`addOuterBuckets`. This function creates two "outer" buckets that fully
cover the lower and upper ends of the domain with the remaining
DistinctCount.

These "outer" buckets interfere with partial stats, so we remove them
with `stripOuterBuckets` when merging partial stats. `stripOuterBuckets`
was simply slicing off the two buckets, but this could potentially leave
non-zero DistinctCount in the first bucket (which is added by
`addOuterBuckets`). `stripOuterBuckets` needs to also zero the
DistinctCount and NumRange of the first bucket to fully undo the actions
of `addOuterBuckets`.

We were only catching this when trying to forecast using the merged
histogram, which would sometimes produce a forecasted histogram with
non-zero DistinctRange in the first bucket, leading to the infamous
"histogram had first bucket with non-zero NumRange or DistinctRange"
Sentry failure.

Unfortunately, I don't think this bug fully explains all of the Sentry
failures that look like cockroachdb#93892 because many of those predate the
introduction of partial stats. I believe this specific failure started
with v24.3.0-alpha.0.

Informs: cockroachdb#93892
Fixes: cockroachdb#134031

(No release note because automatic collection of partial stats is not yet
enabled in any available release.)

Release note: None
craig bot pushed a commit that referenced this issue Nov 16, 2024
135310: sql/stats: more fully undo addOuterBuckets in stripOuterBuckets r=yuzefovich a=michae2

Near the end of `stats.(*histogram).adjustCounts`, if there is leftover DistinctCount after adjusting each histogram bucket, we call `addOuterBuckets`. This function creates two "outer" buckets that fully cover the lower and upper ends of the domain with the remaining DistinctCount.

These "outer" buckets interfere with partial stats, so we remove them with `stripOuterBuckets` when merging partial stats. `stripOuterBuckets` was simply slicing off the two buckets, but this could potentially leave non-zero DistinctCount in the first bucket (which is added by `addOuterBuckets`). `stripOuterBuckets` needs to also zero the DistinctCount and NumRange of the first bucket to fully undo the actions of `addOuterBuckets`.

We were only catching this when trying to forecast using the merged histogram, which would sometimes produce a forecasted histogram with non-zero DistinctRange in the first bucket, leading to the infamous "histogram had first bucket with non-zero NumRange or DistinctRange" Sentry failure.

Unfortunately, I don't think this bug fully explains all of the Sentry failures that look like #93892 because many of those predate the introduction of partial stats. I believe this specific failure started with v24.3.0-alpha.0.

Informs: #93892
Fixes: #134031

(No release note because automatic collection of partial stats is not yet enabled in any available release.)

Release note: None

Co-authored-by: Michael Erickson <[email protected]>
@yuzefovich yuzefovich added branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels Nov 19, 2024
craig bot pushed a commit that referenced this issue Nov 20, 2024
135310: sql/stats: more fully undo addOuterBuckets in stripOuterBuckets r=yuzefovich,mgartner a=michae2

Near the end of `stats.(*histogram).adjustCounts`, if there is leftover DistinctCount after adjusting each histogram bucket, we call `addOuterBuckets`. This function creates two "outer" buckets that fully cover the lower and upper ends of the domain with the remaining DistinctCount.

These "outer" buckets interfere with partial stats, so we remove them with `stripOuterBuckets` when merging partial stats. `stripOuterBuckets` was simply slicing off the two buckets, but this could potentially leave non-zero DistinctCount in the first bucket (which is added by `addOuterBuckets`). `stripOuterBuckets` needs to also zero the DistinctCount and NumRange of the first bucket to fully undo the actions of `addOuterBuckets`.

We were only catching this when trying to forecast using the merged histogram, which would sometimes produce a forecasted histogram with non-zero DistinctRange in the first bucket, leading to the infamous "histogram had first bucket with non-zero NumRange or DistinctRange" Sentry failure.

Unfortunately, I don't think this bug fully explains all of the Sentry failures that look like #93892 because many of those predate the introduction of partial stats. I believe this specific failure started with v24.3.0-alpha.0.

Informs: #93892
Fixes: #134031

(No release note because automatic collection of partial stats is not yet enabled in any available release.)

Release note: None

Co-authored-by: Michael Erickson <[email protected]>
@craig craig bot closed this as completed in 90e311d Nov 20, 2024
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Nov 20, 2024
blathers-crl bot pushed a commit that referenced this issue Nov 20, 2024
Near the end of `stats.(*histogram).adjustCounts`, if there is leftover
DistinctCount after adjusting each histogram bucket, we call
`addOuterBuckets`. This function creates two "outer" buckets that fully
cover the lower and upper ends of the domain with the remaining
DistinctCount.

These "outer" buckets interfere with partial stats, so we remove them
with `stripOuterBuckets` when merging partial stats. `stripOuterBuckets`
was simply slicing off the two buckets, but this could potentially leave
non-zero DistinctCount in the first bucket (which is added by
`addOuterBuckets`). `stripOuterBuckets` needs to also zero the
DistinctCount and NumRange of the first bucket to fully undo the actions
of `addOuterBuckets`.

We were only catching this when trying to forecast using the merged
histogram, which would sometimes produce a forecasted histogram with
non-zero DistinctRange in the first bucket, leading to the infamous
"histogram had first bucket with non-zero NumRange or DistinctRange"
Sentry failure.

Unfortunately, I don't think this bug fully explains all of the Sentry
failures that look like #93892 because many of those predate the
introduction of partial stats. I believe this specific failure started
with v24.3.0-alpha.0.

Informs: #93892
Fixes: #134031

(No release note because automatic collection of partial stats is not yet
enabled in any available release.)

Release note: None
sambhav-jain-16 pushed a commit to sambhav-jain-16/cockroach that referenced this issue Nov 20, 2024
Near the end of `stats.(*histogram).adjustCounts`, if there is leftover
DistinctCount after adjusting each histogram bucket, we call
`addOuterBuckets`. This function creates two "outer" buckets that fully
cover the lower and upper ends of the domain with the remaining
DistinctCount.

These "outer" buckets interfere with partial stats, so we remove them
with `stripOuterBuckets` when merging partial stats. `stripOuterBuckets`
was simply slicing off the two buckets, but this could potentially leave
non-zero DistinctCount in the first bucket (which is added by
`addOuterBuckets`). `stripOuterBuckets` needs to also zero the
DistinctCount and NumRange of the first bucket to fully undo the actions
of `addOuterBuckets`.

We were only catching this when trying to forecast using the merged
histogram, which would sometimes produce a forecasted histogram with
non-zero DistinctRange in the first bucket, leading to the infamous
"histogram had first bucket with non-zero NumRange or DistinctRange"
Sentry failure.

Unfortunately, I don't think this bug fully explains all of the Sentry
failures that look like cockroachdb#93892 because many of those predate the
introduction of partial stats. I believe this specific failure started
with v24.3.0-alpha.0.

Informs: cockroachdb#93892
Fixes: cockroachdb#134031

(No release note because automatic collection of partial stats is not yet
enabled in any available release.)

Release note: None
sambhav-jain-16 pushed a commit to sambhav-jain-16/cockroach that referenced this issue Nov 20, 2024
Near the end of `stats.(*histogram).adjustCounts`, if there is leftover
DistinctCount after adjusting each histogram bucket, we call
`addOuterBuckets`. This function creates two "outer" buckets that fully
cover the lower and upper ends of the domain with the remaining
DistinctCount.

These "outer" buckets interfere with partial stats, so we remove them
with `stripOuterBuckets` when merging partial stats. `stripOuterBuckets`
was simply slicing off the two buckets, but this could potentially leave
non-zero DistinctCount in the first bucket (which is added by
`addOuterBuckets`). `stripOuterBuckets` needs to also zero the
DistinctCount and NumRange of the first bucket to fully undo the actions
of `addOuterBuckets`.

We were only catching this when trying to forecast using the merged
histogram, which would sometimes produce a forecasted histogram with
non-zero DistinctRange in the first bucket, leading to the infamous
"histogram had first bucket with non-zero NumRange or DistinctRange"
Sentry failure.

Unfortunately, I don't think this bug fully explains all of the Sentry
failures that look like cockroachdb#93892 because many of those predate the
introduction of partial stats. I believe this specific failure started
with v24.3.0-alpha.0.

Informs: cockroachdb#93892
Fixes: cockroachdb#134031

(No release note because automatic collection of partial stats is not yet
enabled in any available release.)

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants