Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] merge full/sample statistics collect #52693

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

Seaven
Copy link
Contributor

@Seaven Seaven commented Nov 7, 2024

Why I'm doing:

  1. want to merge sample/full statistics together, save in column_statistics table
  2. want to improve full statistics collect performance
  3. want to improve some metric in sample statsitcs

What I'm doing:

This is 1st PR
Fulll/Sample statistics collect process

  1. split count(1)/min/max from full statistics query, use meta query to collect it.
  2. collect ndv/count null by full statistics query only

Serious issues:
For high cardinality(NDV > 0.1%), sample statistics will get severely distorted NDV:

  1. the unpartition table: the HLL-NDV max value should be the min(row_count * 0.1%, 20w)
  2. the partition table: the HLL-NDV max value should be the partition_nums * min(row_count * 0.1%, 20w)

we will handle the question later, maybe not use hll or use other algorithm

modify code:

  1. refactor some code, for merge sample/full statistic code later
  2. add HyperStatisticsJob, to refactor the sample/full statistics job process
  3. update column statistics query process,only query column_statistics table

the HyperStatisticsJob process, same as FullStatisticsJob

  1. collect mertic by query, and save batch in FE (refactor to HyperQueryJob, FullQueryJob, and later will add SampleQueryJob)
  2. insert batch value to column_statistics

RoadMap

next step:

  1. support Analyze stmt work on partition
  2. update Sample Statistics NDV algorithm
  3. remove SampleStatisticsJob/FullStatisticsJob code

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@Seaven Seaven requested a review from a team as a code owner November 7, 2024 07:48
@mergify mergify bot assigned Seaven Nov 7, 2024
@Seaven Seaven changed the title [Enhancement] support meta statistics [Enhancement] merge full/sample statistics collect Nov 13, 2024
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
@Seaven Seaven requested a review from a team as a code owner November 14, 2024 08:58
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
@@ -2098,6 +2098,9 @@ public class Config extends ConfigBase {
"we would use sample statistics instead of full statistics")
public static double statistic_sample_collect_ratio_threshold_of_first_load = 0.1;

@ConfField(mutable = true)
public static boolean statistic_use_meta_statistics = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put some comment on it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's temp config on main, I will remove it in next PR

import java.util.List;
import java.util.Map;

public class HyperStatisticsCollectJob extends StatisticsCollectJob {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyper ?
I guess it's actually a regular sample-like collection job

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is hyper, because Full/Sample always use it


import java.util.List;

public class ColumnClassifier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment

Copy link
Contributor Author

@Seaven Seaven Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor from com.starrocks.statistic.sample.ColumnSampleManager, will save only one in next pr. For classifiy different column, different column type need different collect way


public abstract class ColumnStats {

protected final String columnName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider add columnId ? as SR has already supported rename column

Copy link
Contributor Author

@Seaven Seaven Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor from com.starrocks.statistic.sample.ColumnStats, will save only one in next pr. I think add columnId a complex work, don't update it in this PR

import java.util.List;
import java.util.stream.Collectors;

public class SubFieldColumnStats extends PrimitiveTypeColumnStats {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to use it ? how to store them in memory if there're thousands of fields in a struct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor from com.starrocks.statistic.sample.SubFieldColumnStats, will save only one in next pr. it's a tool-class for generate statistics SQL, don't store it in memory.

Signed-off-by: Seaven <[email protected]>
Copy link

sonarcloud bot commented Nov 18, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
6.0% Duplication on New Code (required ≤ 3%)
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 547 / 648 (84.41%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/statistic/base/ComplexTypeColumnStats.java 2 9 22.22% [28, 33, 38, 43, 48, 53, 58]
🔵 com/starrocks/statistic/sample/SampleInfo.java 7 17 41.18% [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
🔵 com/starrocks/qe/SessionVariable.java 2 4 50.00% [2583, 2584]
🔵 com/starrocks/statistic/StatisticExecutor.java 9 14 64.29% [145, 146, 167, 168, 169]
🔵 com/starrocks/statistic/hyper/MetaQueryJob.java 52 74 70.27% [66, 67, 68, 72, 73, 74, 109, 110, 111, 113, 114, 115, 117, 119, 120, 121, 122, 123, 124, 125, 126, 140]
🔵 com/starrocks/statistic/HyperStatisticsCollectJob.java 57 77 74.03% [66, 95, 96, 97, 110, 111, 112, 113, 115, 116, 117, 119, 148, 149, 150, 151, 152, 154, 159, 161]
🔵 com/starrocks/statistic/base/TabletSampler.java 14 18 77.78% [44, 45, 46, 70]
🔵 com/starrocks/statistic/base/PartitionSampler.java 54 63 85.71% [76, 106, 107, 108, 109, 110, 111, 113, 115]
🔵 com/starrocks/statistic/StatisticsCollectJobFactory.java 13 15 86.67% [120, 135]
🔵 com/starrocks/statistic/hyper/HyperQueryJob.java 124 135 91.85% [82, 83, 84, 86, 87, 88, 89, 94, 211, 219, 221]
🔵 com/starrocks/statistic/hyper/HyperStatisticSQLs.java 49 53 92.45% [35, 164, 169, 170]
🔵 com/starrocks/statistic/hyper/ConstQueryJob.java 31 33 93.94% [45, 59]
🔵 com/starrocks/statistic/hyper/SampleQueryJob.java 15 16 93.75% [48]
🔵 com/starrocks/statistic/base/SubFieldColumnStats.java 16 17 94.12% [66]
🔵 com/starrocks/statistic/base/ColumnClassifier.java 44 45 97.78% [83]
🔵 com/starrocks/statistic/base/PrimitiveTypeColumnStats.java 25 25 100.00% []
🔵 com/starrocks/common/Config.java 1 1 100.00% []
🔵 com/starrocks/qe/StmtExecutor.java 1 1 100.00% []
🔵 com/starrocks/statistic/base/ColumnStats.java 10 10 100.00% []
🔵 com/starrocks/statistic/hyper/FullQueryJob.java 21 21 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants