Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/stats: change order of operations in histogram.adjustCounts #86493

Open
michae2 opened this issue Aug 19, 2022 · 1 comment
Open

sql/stats: change order of operations in histogram.adjustCounts #86493

michae2 opened this issue Aug 19, 2022 · 1 comment
Assignees
Labels
A-sql-table-stats Table statistics (and their automatic refresh). C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@michae2
Copy link
Collaborator

michae2 commented Aug 19, 2022

Follow up from #79872. This issue tracks this TODO in histogram.adjustCounts:

// TODO(michae2): Consider moving this section above the sections adjusting
// NumEq and NumRange for distinct counts. This would help the adjustments be
// less surprising in some cases.

We use histogram.adjustCounts to adjust an existing histogram to match a given row count and distinct count. It is used both by statistics collection and statistics forecasting. While working on statistics forecasting, I noticed that sometimes adjustCounts produces unexpected adjustments, and I believe this is because it accounts for the new distinct count before accounting for the new row count. If those were reversed, the adjustments would be better (at least in simple cases).

For example:

CREATE TABLE j (t TIME PRIMARY KEY) WITH (sql_stats_automatic_collection_enabled = false);
ALTER TABLE j INJECT STATISTICS '[
      {
          "avg_size": 1,
          "columns": [
              "t"
          ],
          "created_at": "2022-08-13 00:00:00.000000",
          "distinct_count": 6,
          "histo_buckets": [
              {
                  "distinct_range": 0,
                  "num_eq": 1,
                  "num_range": 0,
                  "upper_bound": "00:01:00"
              },
              {
                  "distinct_range": 2,
                  "num_eq": 1,
                  "num_range": 2,
                  "upper_bound": "00:02:00"
              },
              {
                  "distinct_range": 2,
                  "num_eq": 0,
                  "num_range": 2,
                  "upper_bound": "00:03:00"
              }
          ],
          "histo_col_type": "TIME",
          "histo_version": 2,
          "name": "__auto__",
          "null_count": 0,
          "row_count": 6
      },
      {
          "avg_size": 1,
          "columns": [
              "t"
          ],
          "created_at": "2022-08-14 00:00:00.000000",
          "distinct_count": 10,
          "histo_buckets": [
              {
                  "distinct_range": 0,
                  "num_eq": 2,
                  "num_range": 0,
                  "upper_bound": "00:01:00"
              },
              {
                  "distinct_range": 4,
                  "num_eq": 2,
                  "num_range": 4,
                  "upper_bound": "00:02:00"
              },
              {
                  "distinct_range": 4,
                  "num_eq": 0,
                  "num_range": 4,
                  "upper_bound": "00:03:00"
              }
          ],
          "histo_col_type": "TIME",
          "histo_version": 2,
          "name": "__auto__",
          "null_count": 0,
          "row_count": 12
      },
      {
          "avg_size": 1,
          "columns": [
              "t"
          ],
          "created_at": "2022-08-15 00:00:00.000000",
          "distinct_count": 14,
          "histo_buckets": [
              {
                  "distinct_range": 0,
                  "num_eq": 3,
                  "num_range": 0,
                  "upper_bound": "00:01:00"
              },
              {
                  "distinct_range": 6,
                  "num_eq": 3,
                  "num_range": 6,
                  "upper_bound": "00:02:00"
              },
              {
                  "distinct_range": 6,
                  "num_eq": 0,
                  "num_range": 6,
                  "upper_bound": "00:03:00"
              }
          ],
          "histo_col_type": "TIME",
          "histo_version": 2,
          "name": "__auto__",
          "null_count": 0,
          "row_count": 18
      }
]';
SELECT jsonb_pretty(stat->'histo_buckets')
FROM (
  SELECT jsonb_array_elements(statistics) AS stat
  FROM [SHOW STATISTICS USING JSON FOR TABLE j WITH FORECAST]
)
WHERE stat->>'name' = '__forecast__';

Because this is a TIME column we cannot currently forecast the histogram, so instead we call adjustCounts on the latest histogram with row count = 24 and distinct count = 18. I think the adjusted histogram should look like:

  [
      {
          "distinct_range": 0,
          "num_eq": 4,
          "num_range": 0,
          "upper_bound": "00:01:00"
      },
      {
          "distinct_range": 8,
          "num_eq": 4,
          "num_range": 8,
          "upper_bound": "00:02:00"
      },
      {
          "distinct_range": 8,
          "num_eq": 0,
          "num_range": 8,
          "upper_bound": "00:03:00"
      }
  ]

But instead, it currently is:

  [
      {
          "distinct_range": 0,
          "num_eq": 3,
          "num_range": 0,
          "upper_bound": "00:01:00"
      },
      {
          "distinct_range": 8,
          "num_eq": 3,
          "num_range": 9,
          "upper_bound": "00:02:00"
      },
      {
          "distinct_range": 8,
          "num_eq": 0,
          "num_range": 9,
          "upper_bound": "00:03:00"
      }
  ]

Jira issue: CRDB-18781

@michae2 michae2 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-table-stats Table statistics (and their automatic refresh). T-sql-queries SQL Queries Team labels Aug 19, 2022
@michae2 michae2 self-assigned this Aug 19, 2022
@mgartner mgartner moved this to Backlog (DO NOT ADD NEW ISSUES) in SQL Queries Jul 24, 2023

This comment was marked as resolved.

@mgartner mgartner moved this from Backlog (DO NOT ADD NEW ISSUES) to New Backlog in SQL Queries Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-table-stats Table statistics (and their automatic refresh). C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

1 participant