Skip to content

Commit

Permalink
[ML] Add sampled % of documents & cardinality for text fields for Dat…
Browse files Browse the repository at this point in the history
…a visualizer/Field stats & fix missing bucket in doc count chart (#172378)

## Summary

1. **This PR adds sampled % of documents & cardinality for text fields
for Data visualizer/Field stats**. Previously, text fields do not show
any computed % or cardinality. This is because text fields are not
aggregatable in Elasticsearch.

This PR fetches a sample of 1000 documents in Elasticsearch, and compute
the approximate count % and cardinality based on that sample.

<img width="1480" alt="image"
src="https://github.com/elastic/kibana/assets/43350163/8a4e5ddf-36b8-4ca2-90a2-f67ad4a7822c">

It also shows a tooltip message indicating that text fields are using a
much smaller sample:

<img width="1137" alt="image"
src="https://github.com/elastic/kibana/assets/43350163/666e53be-19d8-4eaf-b946-997f3c30b33f">





2. **This PR also fixes an issue with the first bucket "missing" in the
doc count chart**. See #172355.
This happens if the time selected is slightly different from the first
timestamp. This PR changes so that we don't filter out that data if it's
partial.

For example, selecting date to be 18:04 will wipe out the first bucket
at 18:00.

Before:
<img width="1137" alt="image"
src="https://github.com/elastic/kibana/assets/43350163/3dd4a2b7-84f6-40bb-aa77-a8eae14ba8bb">

After:

<img width="1137" alt="image"
src="https://github.com/elastic/kibana/assets/43350163/00c29100-f90e-4477-9374-e1366cea5b7c">

### Checklist

Delete any items that are not applicable to this PR.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
  • Loading branch information
qn895 authored Dec 5, 2023
1 parent b722cbd commit 681b793
Show file tree
Hide file tree
Showing 9 changed files with 235 additions and 53 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -81,11 +81,6 @@ export const DocumentCountChart: FC<Props> = ({
}
);

const xDomain = {
min: timeRangeEarliest,
max: timeRangeLatest,
};

const adjustedChartPoints = useMemo(() => {
// Display empty chart when no data in range
if (chartPoints.length < 1) return [{ time: timeRangeEarliest, value: 0 }];
Expand Down Expand Up @@ -149,7 +144,6 @@ export const DocumentCountChart: FC<Props> = ({
}}
>
<Settings
xDomain={xDomain}
onBrushEnd={onBrushEnd as BrushEndListener}
onElementClick={onElementClick}
theme={chartTheme}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,67 @@
* 2.0.
*/

import { EuiIcon, EuiText } from '@elastic/eui';
import { EuiIcon, EuiText, EuiToolTip } from '@elastic/eui';

import React from 'react';
import { FormattedMessage } from '@kbn/i18n-react';
import { ES_FIELD_TYPES, KBN_FIELD_TYPES } from '@kbn/field-types';
import { SUPPORTED_FIELD_TYPES } from '../../../../../../../common/constants';
import { useDataVisualizerKibana } from '../../../../../kibana_context';
import type { FieldDataRowProps } from '../../types';

interface Props {
cardinality?: number;
interface Props extends FieldDataRowProps {
showIcon?: boolean;
}

export const DistinctValues = ({ cardinality, showIcon }: Props) => {
if (cardinality === undefined) return null;
export const DistinctValues = ({ showIcon, config }: Props) => {
const { stats, type } = config;
const {
services: {
data: { fieldFormats },
},
} = useDataVisualizerKibana();

const cardinality = stats?.cardinality;

if (cardinality === undefined || stats === undefined) return null;

const { sampleCount } = stats;

const tooltipContent =
type === SUPPORTED_FIELD_TYPES.TEXT ? (
<FormattedMessage
id="xpack.dataVisualizer.sampledCardinalityForTextFieldsMsg"
defaultMessage="The cardinality for text fields is calculated from a sample of {sampledDocumentsFormatted} {sampledDocuments, plural, one {record} other {records}}."
values={{
sampledDocuments: sampleCount,
sampledDocumentsFormatted: (
<strong>
{fieldFormats
.getDefaultInstance(KBN_FIELD_TYPES.NUMBER, [ES_FIELD_TYPES.INTEGER])
.convert(sampleCount)}
</strong>
),
}}
/>
) : null;

const icon = showIcon ? (
type === SUPPORTED_FIELD_TYPES.TEXT ? (
<EuiToolTip content={tooltipContent}>
<EuiIcon type="partial" size={'m'} className={'columnHeader__icon'} />
</EuiToolTip>
) : (
<EuiIcon type="database" size={'m'} className={'columnHeader__icon'} />
)
) : null;

const content = <EuiText size={'xs'}>{cardinality}</EuiText>;

return (
<>
{showIcon ? <EuiIcon type="database" size={'m'} className={'columnHeader__icon'} /> : null}
<EuiText size={'xs'}>{cardinality}</EuiText>
{icon}
<EuiToolTip content={tooltipContent}>{content}</EuiToolTip>
</>
);
};
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
* 2.0.
*/

import { EuiIcon, EuiText } from '@elastic/eui';
import { EuiIcon, EuiText, EuiToolTip } from '@elastic/eui';

import React from 'react';
import { ES_FIELD_TYPES, KBN_FIELD_TYPES } from '@kbn/field-types';
import { roundToDecimalPlace } from '@kbn/ml-number-utils';
import { FormattedMessage } from '@kbn/i18n-react';
import { SUPPORTED_FIELD_TYPES } from '../../../../../../../common/constants';
import { useDataVisualizerKibana } from '../../../../../kibana_context';
import { isIndexBasedFieldVisConfig } from '../../../../../../../common/types/field_vis_config';
import type { FieldDataRowProps } from '../../types/field_data_row';
Expand All @@ -19,7 +21,7 @@ interface Props extends FieldDataRowProps {
totalCount?: number;
}
export const DocumentStat = ({ config, showIcon, totalCount }: Props) => {
const { stats } = config;
const { stats, type } = config;
const {
services: {
data: { fieldFormats },
Expand All @@ -40,15 +42,47 @@ export const DocumentStat = ({ config, showIcon, totalCount }: Props) => {
? `(${roundToDecimalPlace((valueCount / total) * 100)}%)`
: null;

const content = (
<EuiText size={'xs'}>
{fieldFormats
.getDefaultInstance(KBN_FIELD_TYPES.NUMBER, [ES_FIELD_TYPES.INTEGER])
.convert(valueCount)}{' '}
{docsPercent}
</EuiText>
);

const tooltipContent =
type === SUPPORTED_FIELD_TYPES.TEXT ? (
<FormattedMessage
id="xpack.dataVisualizer.sampledPercentageForTextFieldsMsg"
defaultMessage="The % of documents for text fields is calculated from a sample of {sampledDocumentsFormatted} {sampledDocuments, plural, one {record} other {records}}."
values={{
sampledDocuments: sampleCount,
sampledDocumentsFormatted: (
<strong>
{fieldFormats
.getDefaultInstance(KBN_FIELD_TYPES.NUMBER, [ES_FIELD_TYPES.INTEGER])
.convert(sampleCount)}
</strong>
),
}}
/>
) : null;

const icon = showIcon ? (
type === SUPPORTED_FIELD_TYPES.TEXT ? (
<EuiToolTip content={tooltipContent}>
<EuiIcon type="partial" size={'m'} className={'columnHeader__icon'} />
</EuiToolTip>
) : (
<EuiIcon type="document" size={'m'} className={'columnHeader__icon'} />
)
) : null;

return valueCount !== undefined ? (
<>
{showIcon ? <EuiIcon type="document" size={'m'} className={'columnHeader__icon'} /> : null}
<EuiText size={'xs'}>
{fieldFormats
.getDefaultInstance(KBN_FIELD_TYPES.NUMBER, [ES_FIELD_TYPES.INTEGER])
.convert(valueCount)}{' '}
{docsPercent}
</EuiText>
{icon}
<EuiToolTip content={tooltipContent}>{content}</EuiToolTip>
</>
) : null;
};
Original file line number Diff line number Diff line change
Expand Up @@ -275,9 +275,7 @@ export const DataVisualizerTable = <T extends DataVisualizerTableItem>({
);
}

return (
<DistinctValues cardinality={item?.stats?.cardinality} showIcon={dimensions.showIcon} />
);
return <DistinctValues config={item} showIcon={dimensions.showIcon} />;
},
sortable: (item: DataVisualizerTableItem) => item?.stats?.cardinality,
align: LEFT_ALIGNMENT as HorizontalAlignment,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,16 @@ import type {
ISearchOptions,
} from '@kbn/data-plugin/common';
import { extractErrorProperties } from '@kbn/ml-error-utils';
import { getProcessedFields } from '@kbn/ml-data-grid';
import { useDataVisualizerKibana } from '../../kibana_context';
import {
AggregatableFieldOverallStats,
checkAggregatableFieldsExistRequest,
checkNonAggregatableFieldExistsRequest,
getSampleOfDocumentsForNonAggregatableFields,
isAggregatableFieldOverallStats,
isNonAggregatableFieldOverallStats,
isNonAggregatableSampledDocs,
NonAggregatableFieldOverallStats,
processAggregatableFieldsExistResponse,
processNonAggregatableFieldsExistResponse,
Expand Down Expand Up @@ -128,6 +131,26 @@ export function useOverallStats<TParams extends OverallStatsSearchStrategyParams
probability
);

const nonAggregatableFieldsExamplesObs = data.search
.search<IKibanaSearchRequest, IKibanaSearchResponse>(
{
params: getSampleOfDocumentsForNonAggregatableFields(
nonAggregatableFields,
index,
searchQuery,
timeFieldName,
earliest,
latest,
runtimeFieldMap
),
},
searchOptions
)
.pipe(
map((resp) => {
return resp as IKibanaSearchResponse;
})
);
const nonAggregatableFieldsObs = nonAggregatableFields.map((fieldName: string) =>
data.search
.search<IKibanaSearchRequest, IKibanaSearchResponse>(
Expand Down Expand Up @@ -190,14 +213,29 @@ export function useOverallStats<TParams extends OverallStatsSearchStrategyParams

const sub = rateLimitingForkJoin<
AggregatableFieldOverallStats | NonAggregatableFieldOverallStats | undefined
>([...aggregatableOverallStatsObs, ...nonAggregatableFieldsObs], MAX_CONCURRENT_REQUESTS);
>(
[
nonAggregatableFieldsExamplesObs,
...aggregatableOverallStatsObs,
...nonAggregatableFieldsObs,
],
MAX_CONCURRENT_REQUESTS
);

searchSubscription$.current = sub.subscribe({
next: (value) => {
const aggregatableOverallStatsResp: AggregatableFieldOverallStats[] = [];
const nonAggregatableOverallStatsResp: NonAggregatableFieldOverallStats[] = [];

let sampledNonAggregatableFieldsExamples: Array<{ [key: string]: string }> | undefined;
value.forEach((resp, idx) => {
if (idx === 0 && isNonAggregatableSampledDocs(resp)) {
const docs = resp.rawResponse.hits.hits.map((d) =>
d.fields ? getProcessedFields(d.fields) : {}
);

sampledNonAggregatableFieldsExamples = docs;
}
if (isAggregatableFieldOverallStats(resp)) {
aggregatableOverallStatsResp.push(resp);
}
Expand All @@ -214,9 +252,27 @@ export function useOverallStats<TParams extends OverallStatsSearchStrategyParams
aggregatableFields
);

const nonAggregatableFieldsCount: number[] = new Array(nonAggregatableFields.length).fill(
0
);
const nonAggregatableFieldsUniqueCount = nonAggregatableFields.map(
() => new Set<string>()
);
if (sampledNonAggregatableFieldsExamples) {
sampledNonAggregatableFieldsExamples.forEach((doc) => {
nonAggregatableFields.forEach((field, fieldIdx) => {
if (doc.hasOwnProperty(field)) {
nonAggregatableFieldsCount[fieldIdx] += 1;
nonAggregatableFieldsUniqueCount[fieldIdx].add(doc[field]!);
}
});
});
}
const nonAggregatableOverallStats = processNonAggregatableFieldsExistResponse(
nonAggregatableOverallStatsResp,
nonAggregatableFields
nonAggregatableFields,
nonAggregatableFieldsCount,
nonAggregatableFieldsUniqueCount
);

setOverallStats({
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,15 @@ export function isNonAggregatableFieldOverallStats(
return isPopulatedObject(arg, ['rawResponse']);
}

export function isNonAggregatableSampledDocs(
arg: unknown
): arg is IKibanaSearchResponse<estypes.SearchResponse<unknown>> {
return (
isPopulatedObject(arg, ['rawResponse']) &&
(arg.rawResponse as estypes.SearchResponse).hasOwnProperty('hits')
);
}

export const processAggregatableFieldsExistResponse = (
responses: AggregatableFieldOverallStats[] | undefined,
aggregatableFields: OverallStatsSearchStrategyParams['aggregatableFields'],
Expand Down Expand Up @@ -204,6 +213,10 @@ export const checkNonAggregatableFieldExistsRequest = (
const size = 0;
const filterCriteria = buildBaseFilterCriteria(timeFieldName, earliestMs, latestMs, query);

if (Array.isArray(filterCriteria)) {
filterCriteria.push({ exists: { field } });
}

const searchBody = {
query: {
bool: {
Expand All @@ -212,9 +225,6 @@ export const checkNonAggregatableFieldExistsRequest = (
},
...(isPopulatedObject(runtimeMappings) ? { runtime_mappings: runtimeMappings } : {}),
};
if (Array.isArray(filterCriteria)) {
filterCriteria.push({ exists: { field } });
}

return {
index,
Expand All @@ -227,9 +237,40 @@ export const checkNonAggregatableFieldExistsRequest = (
};
};

const DEFAULT_DOCS_SAMPLE_OF_TEXT_FIELDS_SIZE = 1000;

export const getSampleOfDocumentsForNonAggregatableFields = (
nonAggregatableFields: string[],
dataViewTitle: string,
query: Query['query'],
timeFieldName: string | undefined,
earliestMs: number | undefined,
latestMs: number | undefined,
runtimeMappings?: estypes.MappingRuntimeFields
): estypes.SearchRequest => {
const index = dataViewTitle;
const filterCriteria = buildBaseFilterCriteria(timeFieldName, earliestMs, latestMs, query);

return {
index,
body: {
fields: nonAggregatableFields.map((fieldName) => fieldName),
query: {
bool: {
filter: filterCriteria,
},
},
...(isPopulatedObject(runtimeMappings) ? { runtime_mappings: runtimeMappings } : {}),
size: DEFAULT_DOCS_SAMPLE_OF_TEXT_FIELDS_SIZE,
},
};
};

export const processNonAggregatableFieldsExistResponse = (
results: IKibanaSearchResponse[] | undefined,
nonAggregatableFields: string[]
nonAggregatableFields: string[],
nonAggregatableFieldsCount: number[],
nonAggregatableFieldsUniqueCount: Array<Set<string>>
) => {
const stats = {
nonAggregatableExistsFields: [] as NonAggregatableField[],
Expand All @@ -238,12 +279,17 @@ export const processNonAggregatableFieldsExistResponse = (

if (!results || nonAggregatableFields.length === 0) return stats;

nonAggregatableFields.forEach((fieldName) => {
nonAggregatableFields.forEach((fieldName, fieldIdx) => {
const foundField = results.find((r) => r.rawResponse.fieldName === fieldName);
const existsInDocs = foundField !== undefined && foundField.rawResponse.hits.total > 0;
const fieldData: NonAggregatableField = {
fieldName,
existsInDocs,
stats: {
count: nonAggregatableFieldsCount[fieldIdx],
cardinality: nonAggregatableFieldsUniqueCount[fieldIdx].size,
sampleCount: DEFAULT_DOCS_SAMPLE_OF_TEXT_FIELDS_SIZE,
},
};
if (existsInDocs === true) {
stats.nonAggregatableExistsFields.push(fieldData);
Expand Down
Loading

0 comments on commit 681b793

Please sign in to comment.