[ML] Use random sampler for field statistics table in Discover and Data visualizer #138953

qn895 · 2022-08-16T19:23:01Z

Summary

Follow-up of #136150. This PR replaces the previously sampling aggregation with the new random sampler in the field statistics table. Changes include:

Removes the Shard size controls in the Data visualizer as well as references of samplerShardSize
Automatically picks the best/most optimal probability for the field statistics table in Discover
For indices without a default time field/are not time based, it will initially make a request to find the value_count aggregation of any available field. Whether the field is populated is not important as we only need to know the sample size to calculate the optimal probability.
Updates the logic so that the initial p will be 1e-5 (instead of the previous 1e-6)
Updates the threshold so that if the number of sampled docs < 109, proceed to do vanilla aggregation without sampling.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

…r-table

…/kibana into ml-dv-random-sampler-table

elasticmachine · 2022-08-23T21:00:47Z

Pinging @elastic/ml-ui (:ml)

…r-table

kibana-ci · 2022-08-23T22:32:37Z

💛 Build succeeded, but was flaky

Failed CI Steps

FTR Configs #1

Test Failures

[job] [logs] FTR Configs #1 / machine learning - data visualizer field statistics in Discover when enabled with farequote index pattern displays the 'Field statistics' table content correctly

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`dataVisualizer`	378	365	-13

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`dataVisualizer`	565.7KB	556.2KB	-9.5KB
`discover`	462.3KB	462.3KB	+33.0B
total			-9.5KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`dataVisualizer`	20.4KB	20.4KB	-45.0B

History

💚 Build #66870 succeeded b5766ee
💔 Build #66807 failed a200eba
💔 Build #65863 failed e775c8b
💔 Build #65545 failed 1148f84
💔 Build #65534 failed 2bd9db0

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @qn895

peteharverson · 2022-08-24T14:23:31Z

x-pack/plugins/data_visualizer/public/application/common/components/top_values/top_values.tsx

@@ -154,9 +147,9 @@ export const TopValues: FC<Props> = ({ stats, fieldFormat, barColor, compressed,
            <EuiText size="xs" textAlign={'center'}>
              <FormattedMessage
                id="xpack.dataVisualizer.dataGrid.field.topValues.calculatedFromSampleDescription"
-                defaultMessage="Calculated from sample of {topValuesSamplerShardSize} documents per shard"
+                defaultMessage="Calculated from sample of {topValuesSampleSize} documents"


Is this message still needed? If random sampler isn't turned on, then there is no 'sampling' is there? And if random sampling is being used, it mirrors the doc count displayed at the top.

peteharverson · 2022-08-25T11:21:56Z

I am seeing this error in the console quite often when switching data views, or changing the query / filter in the view:

Is there anything that can be done to suppress this?

jgowdyelastic · 2022-08-25T09:40:49Z

...zer/public/application/index_data_visualizer/embeddables/grid_embeddable/grid_embeddable.tsx

@@ -141,6 +172,7 @@ export const EmbeddableWrapper = ({
      showPreviewByDefault={input?.showPreviewByDefault}
      onChange={onOutputChange}
      loading={progress < 100}
+      totalCount={overallStats?.documentCountStats?.totalCount ?? 0}


The types suggest overallStats will always be defined.

jgowdyelastic · 2022-08-25T09:42:03Z

...a_visualizer/public/application/index_data_visualizer/hooks/use_data_visualizer_grid_data.ts

    ]
  );

  const { overallStats, progress: overallStatsProgress } = useOverallStats(
    fieldStatsRequest,
    lastRefresh,
    browserSessionSeed,
-    dataVisualizerListState.probability
+    input?.samplingMode === 'autoRandomSampler' ? null : dataVisualizerListState.probability


The types suggest input will always be defined.

jgowdyelastic · 2022-08-25T09:42:12Z

...a_visualizer/public/application/index_data_visualizer/hooks/use_data_visualizer_grid_data.ts

-    dataVisualizerListState
+    dataVisualizerListState,
+    (dataVisualizerListState.probability === null
+      ? overallStats?.documentCountStats?.probability


The types suggest overallStats will always be defined.

jgowdyelastic · 2022-08-25T09:48:58Z

...zer/public/application/index_data_visualizer/embeddables/grid_embeddable/grid_embeddable.tsx

+   * The preferred mode for sampling data for the field statistics
+   * default as 'autoRandomSampler'
+   */
+  samplingMode?: string;


do we know the supported values for the sampling mode?
autoRandomSampler and what else?
This could be a union of all allowed types.
At the moment the code suggests it could just be:

samplingMode?: 'autoRandomSampler';

jgowdyelastic · 2022-08-25T13:31:20Z

...ublic/application/index_data_visualizer/search_strategy/requests/build_random_sampler_agg.ts

+  aggs: any,
+  probability: number | null,
+  seed: number
+): Record<string, estypes.AggregationsAggregationContainer> {


It looks like there are a couple of version of this type:
https://github.com/qn895/kibana/blob/75b2944216cda3f30dffc048972401a5e65e0af2/x-pack/plugins/data_visualizer/common/types/field_stats.ts#L238

https://github.com/qn895/kibana/blob/75b2944216cda3f30dffc048972401a5e65e0af2/x-pack/plugins/data_visualizer/common/utils/datafeed_utils.ts#L11

IMO the type Aggregation = Record<string, estypes.AggregationsAggregationContainer>; is better and matches how aggregations are described in the es client types.

If we clean these up and chose one type, this function could return Record<Aggregation>

jgowdyelastic · 2022-08-25T13:33:04Z

...ublic/application/index_data_visualizer/search_strategy/requests/build_random_sampler_agg.ts

+ * Wraps the supplied aggregations in a random sampler aggregation.
+ */
+export function buildRandomSamplerAggregation(
+  aggs: any,


Can this any be replaced with the correct type?
Looks like the same type Aggregation = Record<string, estypes.AggregationsAggregationContainer>; as the previous comment

jgowdyelastic · 2022-08-25T13:35:03Z

...ublic/application/index_data_visualizer/search_strategy/requests/build_random_sampler_agg.ts

+  return {
+    sample: {
+      aggs,
+      // @ts-expect-error AggregationsAggregationContainer needs to be updated with random_sampler


has an issue been raised in the elasticsearch-specification repo to correct these types?

jgowdyelastic · 2022-08-25T13:38:54Z

...ublic/application/index_data_visualizer/search_strategy/requests/build_random_sampler_agg.ts

+  }
+
+  return {
+    sample: {


Looks like sample could be pulled out here and made into a constant. This could then be used through out the code where things like const aggsPath = ['sample']; are used.

peteharverson · 2022-08-25T14:35:04Z

..._visualizer/public/application/common/components/stats_table/data_visualizer_stats_table.tsx

@@ -221,7 +223,7 @@ export const DataVisualizerTable = <T extends DataVisualizerTableItem>({
          defaultMessage: 'Documents (%)',
        }),
        render: (value: number | undefined, item: DataVisualizerTableItem) => (
-          <DocumentStat config={item} showIcon={dimensions.showIcon} />
+          <DocumentStat config={item} showIcon={dimensions.showIcon} totalCount={totalCount} />


For a data view without a time field, you get Infinity % for percentages in the document stats and top values:

walterra · 2022-08-29T09:32:49Z

Regarding the console error Pete was seeing, we have some reference code in explain log rate spikes that shows how to handle this: https://github.com/elastic/kibana/blob/main/x-pack/plugins/aiops/public/hooks/use_document_count_stats.ts#L134

peteharverson · 2022-11-17T15:13:24Z

Closing as replaced by #144646.

## Summary This PR removes the beta badge for the Field statistics table. <img width="1791" alt="Screen Shot 2022-09-19 at 12 22 30" src="https://user-images.githubusercontent.com/43350163/191076625-9489eaa0-2488-4a5a-b737-e32724d3bffc.png"> Points of consideration for keeping the beta badge: - Easier for us to keep collecting more user feedback. - Potentially switching to [using the new random sampler for aggregation for the field statistics table](#138953) in the next release. Currently, we are pausing this work to match up with the popover (#139072 and #140667) and to fine-tune the user experience/performance. Points of consideration for removing the beta badge: - The field stats table has been available to users since 8.1, and has been in use within ML since 7.x. We should be defining clear criterias for when it can be moved to GA. ### Checklist Delete any items that are not applicable to this PR. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] Any UI touched in this PR is usable by keyboard only (learn more about [keyboard accessibility](https://webaim.org/techniques/keyboard/)) - [ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: [FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/), [Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US)) - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server)) - [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers) ### Risk Matrix Delete this section if it is not applicable to this PR. Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release. When forming the risk matrix, consider some of the following examples and how they may potentially impact the change: | Risk | Probability | Severity | Mitigation/Notes | |---------------------------|-------------|----------|-------------------------| | Multiple Spaces—unexpected behavior in non-default Kibana Space. | Low | High | Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces. | | Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. | High | Low | Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure. | | Code should gracefully handle cases when feature X or plugin Y are disabled. | Medium | High | Unit tests will verify that any feature flag or plugin combination still results in our service operational. | | [See more potential risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) | ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

qn895 added 19 commits August 15, 2022 14:08

[ML] Use random sampler for field stats

0fa735a

[ML] Use random sampler for field stats 2

fb6e112

[ML] Fix count not matching in summary table

7b2b265

Add logic for Discover side management

5fbc10b

Clean up

e4f786e

Set probability to required

2bd9db0

Fix total count for file upload, numeric requests broken, fix types

1148f84

Merge remote-tracking branch 'upstream/main' into ml-dv-random-sample…

8f83d50

…r-table

Fix time field

e775c8b

Fix time field

88f59ec

Separate start and cancel hooks

74c5b71

Remove samplerShardSize control

4a65785

Remove references to samplerShardSize

b1f9521

Remove references to topValuesSamplerSize

a72670c

Update tests

b5ff7b0

Merge remote-tracking branch 'upstream/main' into ml-dv-random-sample…

1a26cb1

…r-table

Update tests

757c4a5

Fix logic for indices without time field, fix translations

3a2fff6

Merge branch 'ml-dv-random-sampler-table' of https://github.com/qn895…

a200eba

…/kibana into ml-dv-random-sampler-table

qn895 changed the title ~~[ML] WIP - Use random sampler for field statistics table in Discover and Data visualizer~~ [ML] Use random sampler for field statistics table in Discover and Data visualizer Aug 23, 2022

qn895 self-assigned this Aug 23, 2022

qn895 added enhancement New value added to drive a business result :ml Feature:File and Index Data Viz ML file and index data visualizer v8.5.0 release_note:enhancement and removed enhancement New value added to drive a business result labels Aug 23, 2022

qn895 requested review from peteharverson, walterra and jgowdyelastic August 23, 2022 16:06

Fix linting

b5766ee

qn895 marked this pull request as ready for review August 23, 2022 21:00

qn895 requested review from a team as code owners August 23, 2022 21:00

qn895 added the ci:cloud-deploy Create or update a Cloud deployment label Aug 23, 2022

qn895 added 2 commits August 23, 2022 16:16

Add clarifying comment for threshold

5ab3c2d

Merge remote-tracking branch 'upstream/main' into ml-dv-random-sample…

75b2944

…r-table

peteharverson reviewed Aug 24, 2022

View reviewed changes

jgowdyelastic reviewed Aug 25, 2022

View reviewed changes

peteharverson reviewed Aug 25, 2022

View reviewed changes

qn895 mentioned this pull request Sep 19, 2022

[ML] Remove beta badge for Field statistics table in Discover #140991

Merged

9 tasks

jughosta mentioned this pull request Sep 22, 2022

[Discover][Lens] Meta - Unified field list #137779

Closed

31 tasks

peteharverson added v8.6.0 and removed v8.5.0 labels Sep 22, 2022

peteharverson closed this Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Use random sampler for field statistics table in Discover and Data visualizer #138953

[ML] Use random sampler for field statistics table in Discover and Data visualizer #138953

qn895 commented Aug 16, 2022 •

edited

Loading

elasticmachine commented Aug 23, 2022

kibana-ci commented Aug 23, 2022

peteharverson Aug 24, 2022

peteharverson commented Aug 25, 2022

jgowdyelastic Aug 25, 2022

jgowdyelastic Aug 25, 2022

jgowdyelastic Aug 25, 2022

jgowdyelastic Aug 25, 2022

jgowdyelastic Aug 25, 2022

jgowdyelastic Aug 25, 2022

jgowdyelastic Aug 25, 2022

jgowdyelastic Aug 25, 2022

peteharverson Aug 25, 2022

walterra commented Aug 29, 2022

peteharverson commented Nov 17, 2022

[ML] Use random sampler for field statistics table in Discover and Data visualizer #138953

[ML] Use random sampler for field statistics table in Discover and Data visualizer #138953

Conversation

qn895 commented Aug 16, 2022 • edited Loading

Summary

Checklist

Risk Matrix

For maintainers

elasticmachine commented Aug 23, 2022

kibana-ci commented Aug 23, 2022

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

Module Count

Async chunks

Page load bundle

History

Choose a reason for hiding this comment

peteharverson commented Aug 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

walterra commented Aug 29, 2022

peteharverson commented Nov 17, 2022

qn895 commented Aug 16, 2022 •

edited

Loading