[ML] Add random sampler to Data visualizer document count chart #136150

qn895 · 2022-07-11T21:10:06Z

Summary

This PR addresses #136124 and uses the new random sampler in the Data visualizer document count chart.

It adds 3 options for sampling:

If On (automatic use best %) is selected, it will first initially run a random sampler agg at a default probability of 0.000001. Then, depending on the result of the initial response, it will either:
- Calculate the next best probability to use, proceed to call the random sampler agg with this calculated best probability, and return result
- Determine the dataset is not suitable for sampling, and therefore will call the random sampler agg with probability = 1 (which is no sampling)
If On (manually set %) is selected, it will show a slider. When user first switch to this option, it will first suggest the last calculated best probability. Once the user picks the probability, it will remember this value for any subsequent queries (like changing time range, modifying the queries or filters).
If Off is selected, it will always run at probability = 1 (which is no sampling)

Screen.Recording.2022-07-20.at.14.17.39.mov

This preference is saved in the local storage, so switching between data view will retain the preference. It will not retain the previously chosen probability.

Checklist

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support

…lue from undefined to null

lcawl · 2022-07-12T22:58:02Z

...isualizer/public/application/common/components/document_count_content/total_count_header.tsx

+      <EuiIconTip
+        content={i18n.translate('xpack.dataVisualizer.searchPanel.randomSamplerMessage', {
+          defaultMessage:
+            'Random sampler is being used for the total document count and the chart. Values shown are estimated. Adjust the slider to a higher percentage for better accuracy, or 100% to exact values.',


I haven't downloaded this PR to test it, but here's a drive-by suggestion:

Suggested change

'Random sampler is being used for the total document count and the chart. Values shown are estimated. Adjust the slider to a higher percentage for better accuracy, or 100% to exact values.',

'The chart and total document count use random sampler aggregations, which increase speed at the cost of accuracy. Adjust the accuracy with the slider. For exact values, set it to 100%.',

peteharverson · 2022-07-14T17:05:22Z

As discussed, here's my initial feedback from testing against larger APM data sets (approx 15M to 40M docs):

generally see a big performance improvement compared to the current approach (around 10x quicker)
the sample size per shard control needs to be removed if this approach is extended to the table rows too
the probability slider has too much prominence. Consider putting it into a popover perhaps behind some sort of advanced settings 'cog' icon?
do we need to allow the user to specify whether they want to automatically recalculate the probability if for example the time range or query / filter are changed? Maybe need a 'automatic' mode switch in addition to the slider control?
reduce the precision used to display the approximate doc count - for example an estimate of 14,977,640 seems too high precision
not related to the changes here, but running against large data sets highlights the need for some sort of loading indicators in the page (doc count chart and table)

…r-two-step-queries

This reverts commit c8c5ab6

…r-two-step-queries

…en the modes

elasticmachine · 2022-07-20T20:56:21Z

Pinging @elastic/ml-ui (:ml)

...lizer/public/application/common/components/document_count_content/document_count_content.tsx

peteharverson · 2022-07-21T16:29:37Z

...isualizer/public/application/common/components/document_count_content/total_count_header.tsx

-    <EuiFlexItem>
-      <EuiText size="s" data-test-subj="dataVisualizerTotalDocCountHeader">
+    <EuiFlexItem grow={false} style={{ flexDirection: 'row' }}>
+      <EuiText size="s" data-test-subj="dataVisualizerTotalDocCountHeader" textAlign="center">


How about adding an info icon and tooltip here when sampling is being, placed between the count and the gear button, to say that random sampling is being and that this is an approximate document count?

peteharverson · 2022-07-21T16:41:04Z

...lizer/public/application/common/components/document_count_content/document_count_content.tsx

+                    'xpack.dataVisualizer.randomSamplerSettingsPopUp.infoCalloutMessage',
+                    {
+                      defaultMessage:
+                        'Random sampler is being used for the total document count and the chart. Pick a higher percentage for better accuracy, or "Off" for no sampling.',


I think this message should change depending on what option you have selected. For example, if it is set to Off, say that random sampling can be turned on for the total document count and chart to increase performance although some accuracy will be lost.

If set to On - automatic, then something like, Random sampling is being used for the total document count and the chart. The probability used in the aggregation will be automatically set to balance accuracy and speed.

If set to On - manual, then something like, Random sampling is being used for the total document count and the chart. A lower percentage probability will increase performance, but some accuracy will be lost.

I agree, if it's possible to customize those messages, it would be great!

peteharverson · 2022-07-21T16:44:34Z

...plugins/data_visualizer/public/application/index_data_visualizer/constants/random_sampler.ts

+  {
+    value: RANDOM_SAMPLER_OPTION.ON_AUTOMATIC,
+    text: i18n.translate('xpack.dataVisualizer.randomSamplerPreference.onAutomaticLabel', {
+      defaultMessage: 'On (automatic use best %)',


The text for the 'On' options needs some tweaking I think. On (automatic configuration) On (manual configuration) ? @lcawl any suggestions?

This is all about balancing speed against accuracy. We want to encourage the user to leave it as 'automatic'.

Yes, I think if we can avoid using "%" in the label (and thus avoid having to explain what that percent actually means), that'd be simpler. Maybe even as simple as "On (automatic)" and "On (manual)"

peteharverson · 2022-07-22T11:41:23Z

...isualizer/public/application/common/components/document_count_content/total_count_header.tsx

+        <EuiIconTip
+          content={i18n.translate('xpack.dataVisualizer.searchPanel.randomSamplerMessage', {
+            defaultMessage:
+              'Random sampler is being used for the total document count and the chart. Values shown are estimated.',


What about using approximate rather than estimated, e.g. Approximate counts are shown.

Updated here 5959f47

peteharverson

Tested latest changes, including the cloud instance with up to 94M docs, and LGTM.

lcawl

Added some text suggestions, but otherwise LGTM

lcawl · 2022-07-22T16:03:46Z

...lizer/public/application/common/components/document_count_content/document_count_content.tsx

+          'xpack.dataVisualizer.randomSamplerSettingsPopUp.onManualCalloutMessage',
+          {
+            defaultMessage:
+              'Random sampling can be turned on for the total document count and chart to increase speed although some accuracy will be lost.',


Not mandatory, here's another version of that sentence where we start with the "why":

Suggested change

'Random sampling can be turned on for the total document count and chart to increase speed although some accuracy will be lost.',

'To increase speed, turn on random sampling for the total document count and chart. Some accuracy will be lost.',

Updated here 5959f47

lcawl · 2022-07-22T16:07:27Z

...lizer/public/application/common/components/document_count_content/document_count_content.tsx

+          'xpack.dataVisualizer.randomSamplerSettingsPopUp.onAutomaticCalloutMessage',
+          {
+            defaultMessage:
+              'Random sampling is being used for the total document count and the chart. The probability used in the aggregation will be automatically set to balance accuracy and speed.',


Not mandatory, but here's a slightly shorter suggestion:

Suggested change

'Random sampling is being used for the total document count and the chart. The probability used in the aggregation will be automatically set to balance accuracy and speed.',

'The total document count and chart use random sampler aggregations. The probability is automatically set to balance accuracy and speed.',

Updated here 5959f47

lcawl · 2022-07-22T16:09:02Z

...lizer/public/application/common/components/document_count_content/document_count_content.tsx

+      default:
+        return i18n.translate('xpack.dataVisualizer.randomSamplerSettingsPopUp.offCalloutMessage', {
+          defaultMessage:
+            'Random sampling is being used for the total document count and the chart. A lower percentage probability will increase performance, but some accuracy will be lost.',


To match the other suggestion:

Suggested change

'Random sampling is being used for the total document count and the chart. A lower percentage probability will increase performance, but some accuracy will be lost.',

'The total document count and chart use random sampler aggregations. A lower percentage probability increases performance, but some accuracy is lost.',

Updated here 5959f47

lcawl · 2022-07-22T16:11:41Z

...isualizer/public/application/common/components/document_count_content/total_count_header.tsx

+        <EuiIconTip
+          content={i18n.translate('xpack.dataVisualizer.searchPanel.randomSamplerMessage', {
+            defaultMessage:
+              'Random sampler is being used for the total document count and the chart. Values shown are estimated.',


To align with my other suggestions and to front-load the most important info:

Suggested change

'Random sampler is being used for the total document count and the chart. Values shown are estimated.',

'Approximate values are shown in the total document count and chart, which use random sampler aggregations.',

Updated here 5959f47

alvarezmelissa87 · 2022-07-22T19:13:55Z

...izer/public/application/index_data_visualizer/search_strategy/requests/get_document_stats.ts

 ): DocumentCountStats | undefined => {
  if (!body) return undefined;

-  const totalCount = (body.hits.total as estypes.SearchTotalHits).value ?? body.hits.total ?? 0;
+  let totalCount = 0;


Why does 'totalCount' need to be set to 0 here?

We are updating the totalCount by adding the count in dataForTime later on as well.

qn895 · 2022-07-22T19:31:16Z

@elasticmachine merge upstream

kibana-ci · 2022-07-22T20:35:23Z

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`dataVisualizer`	366	376	+10

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`dataVisualizer`	547.3KB	561.5KB	+14.2KB

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`dataVisualizer`	35	38	+3

Total ESLint disabled count

id	before	after	diff
`dataVisualizer`	35	38	+3

History

💚 Build #59971 succeeded 5959f47
💚 Build #59713 succeeded 747a466
💚 Build #59576 succeeded c52b0f2
💔 Build #59544 failed 728677e76e6a6a4011ddc78c546586285404f23d
💚 Build #59232 succeeded 3d484af
💔 Build #58919 failed 0eca082

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @qn895

qn895 added 3 commits July 11, 2022 16:08

[ML] Show meta code block with random sampler info

a50c76c

[ML] Show EUI slider

b8a7369

[ML] Improve logic for slider, clean up

3620a9e

qn895 added the :ml label Jul 11, 2022

qn895 self-assigned this Jul 11, 2022

qn895 added Feature:File and Index Data Viz ML file and index data visualizer release_note:feature Makes this part of the condensed release notes v8.4.0 ci:deploy-cloud labels Jul 11, 2022

qn895 added 3 commits July 12, 2022 10:32

[ML] Update translation

479dff4

[ML] Fix types

3abb16a

[ML] Fix missing isDefined check when previously switching initial va…

ae7eae9

…lue from undefined to null

lcawl reviewed Jul 12, 2022

View reviewed changes

qn895 added 13 commits July 14, 2022 13:15

Merge remote-tracking branch 'upstream/main' into ml-dv-random-sample…

995d48d

…r-two-step-queries

Merge remote-tracking branch 'upstream/main' into ml-dv-random-sample…

b02dbda

…r-two-step-queries

[ML] Reduce precision for approximate doc count to 3 significant figures

cdafa5f

[ML] Change to step instead of slider in popup

f99264c

[ML] Use random sampler in field stats aggs

c8c5ab6

[ML] Change to slider, format to 100 for ease of use

03137fa

[ML] Add loading indicator

a2be4d6

Revert "[ML] Use random sampler in field stats aggs"

5cca576

This reverts commit c8c5ab6

[ML] Move search bar up

1f68deb

[ML] Add preference whether to auto pick probability

7ac26a7

Merge remote-tracking branch 'upstream/main' into ml-dv-random-sample…

0eca082

…r-two-step-queries

[ML] Update seed to reflect session, update logic for switching betwe…

3d484af

…en the modes

[ML] Add spinner for doc count label

2d5b0ce

qn895 changed the title ~~[ML] WIP - Add random sampler to Data visualizer document count chart~~ [ML] Add random sampler to Data visualizer document count chart Jul 20, 2022

qn895 marked this pull request as ready for review July 20, 2022 20:56

qn895 requested a review from a team as a code owner July 20, 2022 20:56

qn895 requested review from walterra and peteharverson July 21, 2022 14:41

qn895 force-pushed the ml-dv-random-sampler-two-step-queries branch from 29b2526 to 728677e Compare July 21, 2022 14:42

Merge upstream/main into branch

c52b0f2

qn895 force-pushed the ml-dv-random-sampler-two-step-queries branch from 728677e to c52b0f2 Compare July 21, 2022 15:36

peteharverson reviewed Jul 21, 2022

View reviewed changes

Add custom info call out messages

747a466

peteharverson reviewed Jul 22, 2022

View reviewed changes

peteharverson approved these changes Jul 22, 2022

View reviewed changes

lcawl approved these changes Jul 22, 2022

View reviewed changes

Update texts with latest suggestions

5959f47

alvarezmelissa87 reviewed Jul 22, 2022

View reviewed changes

Merge branch 'main' into ml-dv-random-sampler-two-step-queries

39607f4

qn895 enabled auto-merge (squash) July 22, 2022 19:31

qn895 merged commit 812dce0 into elastic:main Jul 22, 2022

kibanamachine added the backport:skip This commit does not require backporting label Jul 22, 2022

qn895 mentioned this pull request Jul 27, 2022

[ML] Use random sampler for aggregations for Data Visualizer document count chart #136124

Closed

qn895 deleted the ml-dv-random-sampler-two-step-queries branch August 1, 2022 15:43

tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022

qn895 mentioned this pull request Aug 23, 2022

[ML] Use random sampler for field statistics table in Discover and Data visualizer #138953

Closed

9 tasks

qn895 mentioned this pull request Sep 30, 2022

[ML] Fix Index data visualizer doc count when time field is not defined #142409

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add random sampler to Data visualizer document count chart #136150

[ML] Add random sampler to Data visualizer document count chart #136150

qn895 commented Jul 11, 2022 •

edited

Loading

lcawl Jul 12, 2022 •

edited

Loading

peteharverson commented Jul 14, 2022

elasticmachine commented Jul 20, 2022

peteharverson Jul 21, 2022

peteharverson Jul 21, 2022

lcawl Jul 21, 2022

peteharverson Jul 21, 2022

lcawl Jul 21, 2022

peteharverson Jul 22, 2022

qn895 Jul 22, 2022

peteharverson left a comment

lcawl left a comment

lcawl Jul 22, 2022

qn895 Jul 22, 2022

lcawl Jul 22, 2022

qn895 Jul 22, 2022

lcawl Jul 22, 2022

qn895 Jul 22, 2022

lcawl Jul 22, 2022

qn895 Jul 22, 2022

alvarezmelissa87 Jul 22, 2022 •

edited

Loading

qn895 Jul 22, 2022

qn895 commented Jul 22, 2022

kibana-ci commented Jul 22, 2022

ESLint disabled line counts

Total ESLint disabled count

	'Random sampler is being used for the total document count and the chart. Values shown are estimated. Adjust the slider to a higher percentage for better accuracy, or 100% to exact values.',
	'The chart and total document count use random sampler aggregations, which increase speed at the cost of accuracy. Adjust the accuracy with the slider. For exact values, set it to 100%.',

	'Random sampling can be turned on for the total document count and chart to increase speed although some accuracy will be lost.',
	'To increase speed, turn on random sampling for the total document count and chart. Some accuracy will be lost.',

	'Random sampling is being used for the total document count and the chart. The probability used in the aggregation will be automatically set to balance accuracy and speed.',
	'The total document count and chart use random sampler aggregations. The probability is automatically set to balance accuracy and speed.',

	'Random sampler is being used for the total document count and the chart. Values shown are estimated.',
	'Approximate values are shown in the total document count and chart, which use random sampler aggregations.',

[ML] Add random sampler to Data visualizer document count chart #136150

[ML] Add random sampler to Data visualizer document count chart #136150

Conversation

qn895 commented Jul 11, 2022 • edited Loading

Summary

Checklist

lcawl Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

peteharverson commented Jul 14, 2022

elasticmachine commented Jul 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peteharverson left a comment

Choose a reason for hiding this comment

lcawl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvarezmelissa87 Jul 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qn895 commented Jul 22, 2022

kibana-ci commented Jul 22, 2022

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

ESLint disabled line counts

Total ESLint disabled count

History

qn895 commented Jul 11, 2022 •

edited

Loading

lcawl Jul 12, 2022 •

edited

Loading

alvarezmelissa87 Jul 22, 2022 •

edited

Loading