[ML] Adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart #172378

qn895 · 2023-12-01T17:34:47Z

Summary

This PR adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats. Previously, text fields do not show any computed % or cardinality. This is because text fields are not aggregatable in Elasticsearch.

This PR fetches a sample of 1000 documents in Elasticsearch, and compute the approximate count % and cardinality based on that sample.

It also shows a tooltip message indicating that text fields are using a much smaller sample:

This PR also fixes an issue with the first bucket "missing" in the doc count chart. See [ML] Data visualizer: Missing first bar in doc count histogram #172355. This happens if the time selected is slightly different from the first timestamp. This PR changes so that we don't filter out that data if it's partial.

For example, selecting date to be 18:04 will wipe out the first bucket at 18:00.

Before:

After:

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Flaky Test Runner was used on any tests changed
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

elasticmachine · 2023-12-01T17:34:50Z

Pinging @elastic/ml-ui (:ml)

…e is slightly above

…-aggregatable

szabosteve

Left two minor suggestions.

szabosteve · 2023-12-05T10:22:49Z

...ublic/application/common/components/stats_table/components/field_data_row/document_stats.tsx

+    type === SUPPORTED_FIELD_TYPES.TEXT ? (
+      <FormattedMessage
+        id="xpack.dataVisualizer.sampledPercentageForTextFieldsMsg"
+        defaultMessage="The % of documents for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."


I think it could be a bit simpler.

Suggested change

defaultMessage="The % of documents for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."

defaultMessage="The % of documents for text fields is calculated from a sample of {sampledDocumentsFormatted} {sampledDocuments, plural, one {record} other {records}}."

Updated here e6facd0

szabosteve · 2023-12-05T10:23:48Z

...blic/application/common/components/stats_table/components/field_data_row/distinct_values.tsx

+    type === SUPPORTED_FIELD_TYPES.TEXT ? (
+      <FormattedMessage
+        id="xpack.dataVisualizer.sampledCardinalityForTextFieldsMsg"
+        defaultMessage="The cardinality for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."


As above.

Suggested change

defaultMessage="The cardinality for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."

defaultMessage="The cardinality for text fields is calculated from a sample of {sampledDocumentsFormatted} {sampledDocuments, plural, one {record} other {records}}."

Updated here e6facd0

peteharverson

Tested and LGTM.

Testing this for the Field Statistics tab, I noticed that there's an issue with the sample count shown for aggregatable fields in the expanded row, where it incorrectly reports 0 records:

…-aggregatable

szabosteve

UI text LGTM!

jgowdyelastic

Added some minor comments, but overall LGTM

jgowdyelastic · 2023-12-05T16:30:30Z

...blic/application/common/components/stats_table/components/field_data_row/distinct_values.tsx

+import { ES_FIELD_TYPES, KBN_FIELD_TYPES } from '@kbn/field-types';
+import { SUPPORTED_FIELD_TYPES } from '../../../../../../../common/constants';
+import { useDataVisualizerKibana } from '../../../../../kibana_context';
+import { FieldDataRowProps } from '../../types';


nit

Suggested change

import { FieldDataRowProps } from '../../types';

import type { FieldDataRowProps } from '../../types';

Updated here 6e88086

jgowdyelastic · 2023-12-05T16:36:30Z

...blic/application/common/components/stats_table/components/field_data_row/distinct_values.tsx

+    },
+  } = useDataVisualizerKibana();
+
+  const cardinality = config?.stats?.cardinality;


Suggested change

const cardinality = config?.stats?.cardinality;

const cardinality = stats?.cardinality;

Updated here 6e88086

jgowdyelastic · 2023-12-05T16:38:12Z

...ublic/application/common/components/stats_table/components/field_data_row/document_stats.tsx

+    <EuiText size={'xs'}>
+      {fieldFormats
+        .getDefaultInstance(KBN_FIELD_TYPES.NUMBER, [ES_FIELD_TYPES.INTEGER])
+        .convert(valueCount)}{' '}


Is the space char {' '} needed here?

For this one, yes, since it's separating two different values

jgowdyelastic · 2023-12-05T16:42:03Z

.../plugins/data_visualizer/public/application/index_data_visualizer/hooks/use_overall_stats.ts

          value.forEach((resp, idx) => {
+            if (idx === 0 && isNonAggregatableSampledDocs(resp)) {
+              const docs = resp.rawResponse.hits.hits.map((d) =>
+                getProcessedFields(d.fields ?? {})


rather than calling getProcessedFields with an empty object, it might be neater to make the check beforehand.
e.g.

d.fields ? getProcessedFields(d.fields) : {}

Updated here 6e88086

qn895 · 2023-12-05T19:04:50Z

@elasticmachine merge upstream

kibana-ci · 2023-12-05T20:23:00Z

💚 Build Succeeded

Buildkite Build
Commit: 075d5a4

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`dataVisualizer`	616.2KB	618.6KB	+2.4KB

Unknown metric groups

References to deprecated APIs

id	before	after	diff
`dataVisualizer`	21	22	+1

History

💔 Build #181906 failed 6e88086
💚 Build #181461 succeeded adb6ae0
💔 Build #181187 failed 12bc9f7
💔 Build #181025 failed bbed315
💔 Build #181017 failed b767c85

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @qn895

qn895 added 3 commits December 1, 2023 11:07

Add sampled % of documents & cardinality for text fields

f4ce944

Change to constants

ed88d3b

Update i18n ids

b767c85

qn895 added :ml v8.12.0 labels Dec 1, 2023

qn895 requested review from alvarezmelissa87 and peteharverson December 1, 2023 17:34

qn895 self-assigned this Dec 1, 2023

qn895 requested a review from a team as a code owner December 1, 2023 17:34

qn895 added release_note:enhancement Feature:File and Index Data Viz ML file and index data visualizer labels Dec 1, 2023

qn895 added 4 commits December 1, 2023 12:40

Fix type

bbed315

Fix isNonAggregatableSampledDocs check

40cfce9

Fix formatting

00f1eab

Fix bug with doc count chart not showing first bucket if selected tim…

12bc9f7

…e is slightly above

qn895 changed the title ~~[ML] Add sampled % of documents & cardinality for text fields for Data visualizer/Field stats~~ [ML] Add sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart Dec 4, 2023

qn895 requested a review from szabosteve December 4, 2023 03:15

qn895 added 2 commits December 4, 2023 10:52

Update more docCountFormatted in functional tests

1940769

Merge remote-tracking branch 'upstream/main' into ml-dv-stats-for-non…

adb6ae0

…-aggregatable

szabosteve reviewed Dec 5, 2023

View reviewed changes

peteharverson approved these changes Dec 5, 2023

View reviewed changes

qn895 added 2 commits December 5, 2023 10:09

Update translations

e6facd0

Merge remote-tracking branch 'upstream/main' into ml-dv-stats-for-non…

fbe174b

…-aggregatable

szabosteve approved these changes Dec 5, 2023

View reviewed changes

jgowdyelastic approved these changes Dec 5, 2023

View reviewed changes

Fix comments

6e88086

qn895 enabled auto-merge (squash) December 5, 2023 17:28

Merge branch 'main' into ml-dv-stats-for-non-aggregatable

075d5a4

qn895 merged commit 681b793 into elastic:main Dec 5, 2023
36 checks passed

kibanamachine added the backport:skip This commit does not require backporting label Dec 5, 2023

This was referenced Dec 6, 2023

[ML] Data visualizer: Missing first bar in doc count histogram #172355

Closed

[ML] Shows some stats for non-aggregatable fields in Field statistics #160280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart #172378

[ML] Adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart #172378

qn895 commented Dec 1, 2023 •

edited

Loading

elasticmachine commented Dec 1, 2023

szabosteve left a comment

szabosteve Dec 5, 2023

qn895 Dec 5, 2023

szabosteve Dec 5, 2023 •

edited

Loading

qn895 Dec 5, 2023

peteharverson left a comment

szabosteve left a comment

jgowdyelastic left a comment

jgowdyelastic Dec 5, 2023

qn895 Dec 5, 2023

jgowdyelastic Dec 5, 2023

qn895 Dec 5, 2023

jgowdyelastic Dec 5, 2023

qn895 Dec 5, 2023

jgowdyelastic Dec 5, 2023

qn895 Dec 5, 2023

qn895 commented Dec 5, 2023

kibana-ci commented Dec 5, 2023

References to deprecated APIs

	defaultMessage="The % of documents for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."
	defaultMessage="The % of documents for text fields is calculated from a sample of {sampledDocumentsFormatted} {sampledDocuments, plural, one {record} other {records}}."

	defaultMessage="The cardinality for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."
	defaultMessage="The cardinality for text fields is calculated from a sample of {sampledDocumentsFormatted} {sampledDocuments, plural, one {record} other {records}}."

	import { FieldDataRowProps } from '../../types';
	import type { FieldDataRowProps } from '../../types';

	const cardinality = config?.stats?.cardinality;
	const cardinality = stats?.cardinality;

[ML] Adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart #172378

[ML] Adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart #172378

Conversation

qn895 commented Dec 1, 2023 • edited Loading

Summary

Checklist

Risk Matrix

For maintainers

elasticmachine commented Dec 1, 2023

szabosteve left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szabosteve Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peteharverson left a comment

Choose a reason for hiding this comment

szabosteve left a comment

Choose a reason for hiding this comment

jgowdyelastic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qn895 commented Dec 5, 2023

kibana-ci commented Dec 5, 2023

💚 Build Succeeded

Metrics [docs]

Async chunks

References to deprecated APIs

History

qn895 commented Dec 1, 2023 •

edited

Loading

szabosteve Dec 5, 2023 •

edited

Loading