Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Use random sampler for field statistics table in Discover and Data visualizer #138953

Closed
wants to merge 22 commits into from

Conversation

qn895
Copy link
Member

@qn895 qn895 commented Aug 16, 2022

Summary

Follow-up of #136150. This PR replaces the previously sampling aggregation with the new random sampler in the field statistics table. Changes include:

  • Removes the Shard size controls in the Data visualizer as well as references of samplerShardSize
  • Automatically picks the best/most optimal probability for the field statistics table in Discover
  • For indices without a default time field/are not time based, it will initially make a request to find the value_count aggregation of any available field. Whether the field is populated is not important as we only need to know the sample size to calculate the optimal probability.
  • Updates the logic so that the initial p will be 1e-5 (instead of the previous 1e-6)
  • Updates the threshold so that if the number of sampled docs < 109, proceed to do vanilla aggregation without sampling.

Screen Shot 2022-08-23 at 11 02 28

Screen Shot 2022-08-23 at 11 02 49

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@qn895 qn895 changed the title [ML] WIP - Use random sampler for field statistics table in Discover and Data visualizer [ML] Use random sampler for field statistics table in Discover and Data visualizer Aug 23, 2022
@qn895 qn895 self-assigned this Aug 23, 2022
@qn895 qn895 added enhancement New value added to drive a business result :ml Feature:File and Index Data Viz ML file and index data visualizer v8.5.0 release_note:enhancement and removed enhancement New value added to drive a business result labels Aug 23, 2022
@qn895 qn895 marked this pull request as ready for review August 23, 2022 21:00
@qn895 qn895 requested review from a team as code owners August 23, 2022 21:00
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

@qn895 qn895 added the ci:cloud-deploy Create or update a Cloud deployment label Aug 23, 2022
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #1 / machine learning - data visualizer field statistics in Discover when enabled with farequote index pattern displays the 'Field statistics' table content correctly

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
dataVisualizer 378 365 -13

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
dataVisualizer 565.7KB 556.2KB -9.5KB
discover 462.3KB 462.3KB +33.0B
total -9.5KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
dataVisualizer 20.4KB 20.4KB -45.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @qn895

@@ -154,9 +147,9 @@ export const TopValues: FC<Props> = ({ stats, fieldFormat, barColor, compressed,
<EuiText size="xs" textAlign={'center'}>
<FormattedMessage
id="xpack.dataVisualizer.dataGrid.field.topValues.calculatedFromSampleDescription"
defaultMessage="Calculated from sample of {topValuesSamplerShardSize} documents per shard"
defaultMessage="Calculated from sample of {topValuesSampleSize} documents"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this message still needed? If random sampler isn't turned on, then there is no 'sampling' is there? And if random sampling is being used, it mirrors the doc count displayed at the top.

image

@peteharverson
Copy link
Contributor

I am seeing this error in the console quite often when switching data views, or changing the query / filter in the view:

image

Is there anything that can be done to suppress this?

@@ -141,6 +172,7 @@ export const EmbeddableWrapper = ({
showPreviewByDefault={input?.showPreviewByDefault}
onChange={onOutputChange}
loading={progress < 100}
totalCount={overallStats?.documentCountStats?.totalCount ?? 0}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types suggest overallStats will always be defined.

]
);

const { overallStats, progress: overallStatsProgress } = useOverallStats(
fieldStatsRequest,
lastRefresh,
browserSessionSeed,
dataVisualizerListState.probability
input?.samplingMode === 'autoRandomSampler' ? null : dataVisualizerListState.probability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types suggest input will always be defined.

dataVisualizerListState
dataVisualizerListState,
(dataVisualizerListState.probability === null
? overallStats?.documentCountStats?.probability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types suggest overallStats will always be defined.

* The preferred mode for sampling data for the field statistics
* default as 'autoRandomSampler'
*/
samplingMode?: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know the supported values for the sampling mode?
autoRandomSampler and what else?
This could be a union of all allowed types.
At the moment the code suggests it could just be:

samplingMode?: 'autoRandomSampler';

aggs: any,
probability: number | null,
seed: number
): Record<string, estypes.AggregationsAggregationContainer> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are a couple of version of this type:
https://github.com/qn895/kibana/blob/75b2944216cda3f30dffc048972401a5e65e0af2/x-pack/plugins/data_visualizer/common/types/field_stats.ts#L238

https://github.com/qn895/kibana/blob/75b2944216cda3f30dffc048972401a5e65e0af2/x-pack/plugins/data_visualizer/common/utils/datafeed_utils.ts#L11

IMO the type Aggregation = Record<string, estypes.AggregationsAggregationContainer>; is better and matches how aggregations are described in the es client types.

If we clean these up and chose one type, this function could return Record<Aggregation>

* Wraps the supplied aggregations in a random sampler aggregation.
*/
export function buildRandomSamplerAggregation(
aggs: any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this any be replaced with the correct type?
Looks like the same type Aggregation = Record<string, estypes.AggregationsAggregationContainer>; as the previous comment

return {
sample: {
aggs,
// @ts-expect-error AggregationsAggregationContainer needs to be updated with random_sampler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has an issue been raised in the elasticsearch-specification repo to correct these types?

}

return {
sample: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like sample could be pulled out here and made into a constant. This could then be used through out the code where things like const aggsPath = ['sample']; are used.

@@ -221,7 +223,7 @@ export const DataVisualizerTable = <T extends DataVisualizerTableItem>({
defaultMessage: 'Documents (%)',
}),
render: (value: number | undefined, item: DataVisualizerTableItem) => (
<DocumentStat config={item} showIcon={dimensions.showIcon} />
<DocumentStat config={item} showIcon={dimensions.showIcon} totalCount={totalCount} />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a data view without a time field, you get Infinity % for percentages in the document stats and top values:

image

@walterra
Copy link
Contributor

Regarding the console error Pete was seeing, we have some reference code in explain log rate spikes that shows how to handle this: https://github.com/elastic/kibana/blob/main/x-pack/plugins/aiops/public/hooks/use_document_count_stats.ts#L134

@peteharverson
Copy link
Contributor

Closing as replaced by #144646.

qn895 added a commit that referenced this pull request Dec 15, 2022
## Summary

This PR removes the beta badge for the Field statistics table. 

<img width="1791" alt="Screen Shot 2022-09-19 at 12 22 30"
src="https://user-images.githubusercontent.com/43350163/191076625-9489eaa0-2488-4a5a-b737-e32724d3bffc.png">

Points of consideration for keeping the beta badge:
- Easier for us to keep collecting more user feedback.
- Potentially switching to [using the new random sampler for aggregation
for the field statistics
table](#138953) in the next
release. Currently, we are pausing this work to match up with the
popover (#139072 and
#140667) and to fine-tune the user
experience/performance.

Points of consideration for removing the beta badge:
- The field stats table has been available to users since 8.1, and has
been in use within ML since 7.x. We should be defining clear criterias
for when it can be moved to GA.

### Checklist

Delete any items that are not applicable to this PR.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:cloud-deploy Create or update a Cloud deployment Feature:File and Index Data Viz ML file and index data visualizer :ml release_note:enhancement v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants