Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add random sampler to Data visualizer document count chart #136150

Merged
merged 23 commits into from
Jul 22, 2022

Conversation

qn895
Copy link
Member

@qn895 qn895 commented Jul 11, 2022

Summary

This PR addresses #136124 and uses the new random sampler in the Data visualizer document count chart.

It adds 3 options for sampling:
Screen Shot 2022-07-20 at 14 18 43
Screen Shot 2022-07-20 at 14 18 53
Screen Shot 2022-07-20 at 14 18 57

  • If On (automatic use best %) is selected, it will first initially run a random sampler agg at a default probability of 0.000001. Then, depending on the result of the initial response, it will either:
    • Calculate the next best probability to use, proceed to call the random sampler agg with this calculated best probability, and return result
    • Determine the dataset is not suitable for sampling, and therefore will call the random sampler agg with probability = 1 (which is no sampling)
  • If On (manually set %) is selected, it will show a slider. When user first switch to this option, it will first suggest the last calculated best probability. Once the user picks the probability, it will remember this value for any subsequent queries (like changing time range, modifying the queries or filters).
  • If Off is selected, it will always run at probability = 1 (which is no sampling)
Screen.Recording.2022-07-20.at.14.17.39.mov
  • This preference is saved in the local storage, so switching between data view will retain the preference. It will not retain the previously chosen probability.

Checklist

@qn895 qn895 added the :ml label Jul 11, 2022
@qn895 qn895 self-assigned this Jul 11, 2022
@qn895 qn895 added Feature:File and Index Data Viz ML file and index data visualizer release_note:feature Makes this part of the condensed release notes v8.4.0 ci:deploy-cloud labels Jul 11, 2022
<EuiIconTip
content={i18n.translate('xpack.dataVisualizer.searchPanel.randomSamplerMessage', {
defaultMessage:
'Random sampler is being used for the total document count and the chart. Values shown are estimated. Adjust the slider to a higher percentage for better accuracy, or 100% to exact values.',
Copy link
Contributor

@lcawl lcawl Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't downloaded this PR to test it, but here's a drive-by suggestion:

Suggested change
'Random sampler is being used for the total document count and the chart. Values shown are estimated. Adjust the slider to a higher percentage for better accuracy, or 100% to exact values.',
'The chart and total document count use random sampler aggregations, which increase speed at the cost of accuracy. Adjust the accuracy with the slider. For exact values, set it to 100%.',

@peteharverson
Copy link
Contributor

As discussed, here's my initial feedback from testing against larger APM data sets (approx 15M to 40M docs):

  • generally see a big performance improvement compared to the current approach (around 10x quicker)
  • the sample size per shard control needs to be removed if this approach is extended to the table rows too
  • the probability slider has too much prominence. Consider putting it into a popover perhaps behind some sort of advanced settings 'cog' icon?
  • do we need to allow the user to specify whether they want to automatically recalculate the probability if for example the time range or query / filter are changed? Maybe need a 'automatic' mode switch in addition to the slider control?
  • reduce the precision used to display the approximate doc count - for example an estimate of 14,977,640 seems too high precision
  • not related to the changes here, but running against large data sets highlights the need for some sort of loading indicators in the page (doc count chart and table)

@qn895 qn895 changed the title [ML] WIP - Add random sampler to Data visualizer document count chart [ML] Add random sampler to Data visualizer document count chart Jul 20, 2022
@qn895 qn895 marked this pull request as ready for review July 20, 2022 20:56
@qn895 qn895 requested a review from a team as a code owner July 20, 2022 20:56
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

@qn895 qn895 force-pushed the ml-dv-random-sampler-two-step-queries branch from 29b2526 to 728677e Compare July 21, 2022 14:42
@qn895 qn895 force-pushed the ml-dv-random-sampler-two-step-queries branch from 728677e to c52b0f2 Compare July 21, 2022 15:36
<EuiFlexItem>
<EuiText size="s" data-test-subj="dataVisualizerTotalDocCountHeader">
<EuiFlexItem grow={false} style={{ flexDirection: 'row' }}>
<EuiText size="s" data-test-subj="dataVisualizerTotalDocCountHeader" textAlign="center">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding an info icon and tooltip here when sampling is being, placed between the count and the gear button, to say that random sampling is being and that this is an approximate document count?

image

'xpack.dataVisualizer.randomSamplerSettingsPopUp.infoCalloutMessage',
{
defaultMessage:
'Random sampler is being used for the total document count and the chart. Pick a higher percentage for better accuracy, or "Off" for no sampling.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this message should change depending on what option you have selected. For example, if it is set to Off, say that random sampling can be turned on for the total document count and chart to increase performance although some accuracy will be lost.

If set to On - automatic, then something like, Random sampling is being used for the total document count and the chart. The probability used in the aggregation will be automatically set to balance accuracy and speed.

If set to On - manual, then something like, Random sampling is being used for the total document count and the chart. A lower percentage probability will increase performance, but some accuracy will be lost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if it's possible to customize those messages, it would be great!

{
value: RANDOM_SAMPLER_OPTION.ON_AUTOMATIC,
text: i18n.translate('xpack.dataVisualizer.randomSamplerPreference.onAutomaticLabel', {
defaultMessage: 'On (automatic use best %)',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text for the 'On' options needs some tweaking I think. On (automatic configuration) On (manual configuration) ? @lcawl any suggestions?

This is all about balancing speed against accuracy. We want to encourage the user to leave it as 'automatic'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think if we can avoid using "%" in the label (and thus avoid having to explain what that percent actually means), that'd be simpler. Maybe even as simple as "On (automatic)" and "On (manual)"

<EuiIconTip
content={i18n.translate('xpack.dataVisualizer.searchPanel.randomSamplerMessage', {
defaultMessage:
'Random sampler is being used for the total document count and the chart. Values shown are estimated.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using approximate rather than estimated, e.g. Approximate counts are shown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 5959f47

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested latest changes, including the cloud instance with up to 94M docs, and LGTM.

Copy link
Contributor

@lcawl lcawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some text suggestions, but otherwise LGTM

'xpack.dataVisualizer.randomSamplerSettingsPopUp.onManualCalloutMessage',
{
defaultMessage:
'Random sampling can be turned on for the total document count and chart to increase speed although some accuracy will be lost.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not mandatory, here's another version of that sentence where we start with the "why":

Suggested change
'Random sampling can be turned on for the total document count and chart to increase speed although some accuracy will be lost.',
'To increase speed, turn on random sampling for the total document count and chart. Some accuracy will be lost.',

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 5959f47

'xpack.dataVisualizer.randomSamplerSettingsPopUp.onAutomaticCalloutMessage',
{
defaultMessage:
'Random sampling is being used for the total document count and the chart. The probability used in the aggregation will be automatically set to balance accuracy and speed.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not mandatory, but here's a slightly shorter suggestion:

Suggested change
'Random sampling is being used for the total document count and the chart. The probability used in the aggregation will be automatically set to balance accuracy and speed.',
'The total document count and chart use random sampler aggregations. The probability is automatically set to balance accuracy and speed.',

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 5959f47

default:
return i18n.translate('xpack.dataVisualizer.randomSamplerSettingsPopUp.offCalloutMessage', {
defaultMessage:
'Random sampling is being used for the total document count and the chart. A lower percentage probability will increase performance, but some accuracy will be lost.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the other suggestion:

Suggested change
'Random sampling is being used for the total document count and the chart. A lower percentage probability will increase performance, but some accuracy will be lost.',
'The total document count and chart use random sampler aggregations. A lower percentage probability increases performance, but some accuracy is lost.',

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 5959f47

<EuiIconTip
content={i18n.translate('xpack.dataVisualizer.searchPanel.randomSamplerMessage', {
defaultMessage:
'Random sampler is being used for the total document count and the chart. Values shown are estimated.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align with my other suggestions and to front-load the most important info:

Suggested change
'Random sampler is being used for the total document count and the chart. Values shown are estimated.',
'Approximate values are shown in the total document count and chart, which use random sampler aggregations.',

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 5959f47

): DocumentCountStats | undefined => {
if (!body) return undefined;

const totalCount = (body.hits.total as estypes.SearchTotalHits).value ?? body.hits.total ?? 0;
let totalCount = 0;
Copy link
Contributor

@alvarezmelissa87 alvarezmelissa87 Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does 'totalCount' need to be set to 0 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are updating the totalCount by adding the count in dataForTime later on as well.

@qn895
Copy link
Member Author

qn895 commented Jul 22, 2022

@elasticmachine merge upstream

@qn895 qn895 enabled auto-merge (squash) July 22, 2022 19:31
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
dataVisualizer 366 376 +10

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
dataVisualizer 547.3KB 561.5KB +14.2KB
Unknown metric groups

ESLint disabled line counts

id before after diff
dataVisualizer 35 38 +3

Total ESLint disabled count

id before after diff
dataVisualizer 35 38 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @qn895

@qn895 qn895 merged commit 812dce0 into elastic:main Jul 22, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jul 22, 2022
@qn895 qn895 deleted the ml-dv-random-sampler-two-step-queries branch August 1, 2022 15:43
@tylersmalley tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:cloud-deploy Create or update a Cloud deployment Feature:File and Index Data Viz ML file and index data visualizer :ml release_note:feature Makes this part of the condensed release notes v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants