Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart #172378

Merged
merged 13 commits into from
Dec 5, 2023

Conversation

qn895
Copy link
Member

@qn895 qn895 commented Dec 1, 2023

Summary

  1. This PR adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats. Previously, text fields do not show any computed % or cardinality. This is because text fields are not aggregatable in Elasticsearch.

This PR fetches a sample of 1000 documents in Elasticsearch, and compute the approximate count % and cardinality based on that sample.

image

It also shows a tooltip message indicating that text fields are using a much smaller sample:

image
  1. This PR also fixes an issue with the first bucket "missing" in the doc count chart. See [ML] Data visualizer: Missing first bar in doc count histogram #172355. This happens if the time selected is slightly different from the first timestamp. This PR changes so that we don't filter out that data if it's partial.

For example, selecting date to be 18:04 will wipe out the first bucket at 18:00.

Before:
image

After:

image

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@qn895 qn895 self-assigned this Dec 1, 2023
@qn895 qn895 requested a review from a team as a code owner December 1, 2023 17:34
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

@qn895 qn895 changed the title [ML] Add sampled % of documents & cardinality for text fields for Data visualizer/Field stats [ML] Add sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart Dec 4, 2023
@qn895 qn895 requested a review from szabosteve December 4, 2023 03:15
Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two minor suggestions.

type === SUPPORTED_FIELD_TYPES.TEXT ? (
<FormattedMessage
id="xpack.dataVisualizer.sampledPercentageForTextFieldsMsg"
defaultMessage="The % of documents for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be a bit simpler.

Suggested change
defaultMessage="The % of documents for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."
defaultMessage="The % of documents for text fields is calculated from a sample of {sampledDocumentsFormatted} {sampledDocuments, plural, one {record} other {records}}."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here e6facd0

type === SUPPORTED_FIELD_TYPES.TEXT ? (
<FormattedMessage
id="xpack.dataVisualizer.sampledCardinalityForTextFieldsMsg"
defaultMessage="The cardinality for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."
Copy link
Contributor

@szabosteve szabosteve Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Suggested change
defaultMessage="The cardinality for text fields is sampled and calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."
defaultMessage="The cardinality for text fields is calculated from a sample of {sampledDocumentsFormatted} {sampledDocuments, plural, one {record} other {records}}."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here e6facd0

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and LGTM.

Testing this for the Field Statistics tab, I noticed that there's an issue with the sample count shown for aggregatable fields in the expanded row, where it incorrectly reports 0 records:

image

Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI text LGTM!

Copy link
Member

@jgowdyelastic jgowdyelastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments, but overall LGTM

import { ES_FIELD_TYPES, KBN_FIELD_TYPES } from '@kbn/field-types';
import { SUPPORTED_FIELD_TYPES } from '../../../../../../../common/constants';
import { useDataVisualizerKibana } from '../../../../../kibana_context';
import { FieldDataRowProps } from '../../types';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
import { FieldDataRowProps } from '../../types';
import type { FieldDataRowProps } from '../../types';

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 6e88086

},
} = useDataVisualizerKibana();

const cardinality = config?.stats?.cardinality;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const cardinality = config?.stats?.cardinality;
const cardinality = stats?.cardinality;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 6e88086

<EuiText size={'xs'}>
{fieldFormats
.getDefaultInstance(KBN_FIELD_TYPES.NUMBER, [ES_FIELD_TYPES.INTEGER])
.convert(valueCount)}{' '}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the space char {' '} needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, yes, since it's separating two different values

value.forEach((resp, idx) => {
if (idx === 0 && isNonAggregatableSampledDocs(resp)) {
const docs = resp.rawResponse.hits.hits.map((d) =>
getProcessedFields(d.fields ?? {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than calling getProcessedFields with an empty object, it might be neater to make the check beforehand.
e.g.

d.fields ? getProcessedFields(d.fields) : {}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 6e88086

@qn895 qn895 enabled auto-merge (squash) December 5, 2023 17:28
@qn895
Copy link
Member Author

qn895 commented Dec 5, 2023

@elasticmachine merge upstream

@qn895 qn895 merged commit 681b793 into elastic:main Dec 5, 2023
36 checks passed
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
dataVisualizer 616.2KB 618.6KB +2.4KB
Unknown metric groups

References to deprecated APIs

id before after diff
dataVisualizer 21 22 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @qn895

@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Dec 5, 2023
@szabosteve szabosteve changed the title [ML] Add sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart [ML] Adds sampled % of documents & cardinality for text fields for Data visualizer/Field stats & fix missing bucket in doc count chart Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:File and Index Data Viz ML file and index data visualizer :ml release_note:enhancement v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants