Skip to content

Commit

Permalink
[8.x] [Auto Import] CSV format support (#194386) (#196090)
Browse files Browse the repository at this point in the history
# Backport

This will backport the following commits from `main` to `8.x`:
- [[Auto Import] CSV format support
(#194386)](#194386)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ilya
Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-14T10:24:58Z","message":"[Auto
Import] CSV format support (#194386)\n\n## Release
Notes\r\n\r\nAutomatic Import can now create integrations for logs in
the CSV format.\r\nOwing to the maturity of log format support, we thus
remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n##
Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is
#194342 \r\n\r\nWhen the user
adds a log sample whose format is recognized as CSV by the\r\nLLM, we
now parse the samples and insert
the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor
into the generated pipeline.\r\n\r\nIf the header is present, we use it
for the field names and add
a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor
that removes a header from the document stream by comparing\r\nthe
values to the header values.\r\n\r\nIf the header is missing, we ask the
LLM to generate a list of column\r\nnames, providing some context like
package and data stream title.\r\n\r\nShould the header or LLM
suggestion provide unsuitable for a specific\r\ncolumn, we use
`column1`, `column2` and so on as a fallback. To avoid\r\nduplicate
column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the
format appears to be CSV, but the `csv` processor returns fails,\r\nwe
bubble up an error using the recently
introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide
the first\r\nexample of passing the additional attributes of an error
(in this case,\r\nthe original CSV error) back to the client. The error
message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported
formats message**\r\n \r\nThe message that asks the user to upload the
logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img
width=\"741\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n
\r\nThe refactoring makes the \"→JSON\" conversion process more uniform
across\r\ndifferent chains and centralizes processor definitions
in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now
expects the LLM to follow the `SamplesFormat` when\r\nproviding the
information rather than an ad-hoc format.\r\n \r\nWhen testing, the
`fail` method is [not supported
in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it
is\r\nremoved.\r\n\r\nSee the PR for examples and
follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["v9.0.0","release_note:feature","backport:prev-minor","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto
Import] CSV format
support","number":194386,"url":"https://github.com/elastic/kibana/pull/194386","mergeCommit":{"message":"[Auto
Import] CSV format support (#194386)\n\n## Release
Notes\r\n\r\nAutomatic Import can now create integrations for logs in
the CSV format.\r\nOwing to the maturity of log format support, we thus
remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n##
Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is
#194342 \r\n\r\nWhen the user
adds a log sample whose format is recognized as CSV by the\r\nLLM, we
now parse the samples and insert
the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor
into the generated pipeline.\r\n\r\nIf the header is present, we use it
for the field names and add
a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor
that removes a header from the document stream by comparing\r\nthe
values to the header values.\r\n\r\nIf the header is missing, we ask the
LLM to generate a list of column\r\nnames, providing some context like
package and data stream title.\r\n\r\nShould the header or LLM
suggestion provide unsuitable for a specific\r\ncolumn, we use
`column1`, `column2` and so on as a fallback. To avoid\r\nduplicate
column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the
format appears to be CSV, but the `csv` processor returns fails,\r\nwe
bubble up an error using the recently
introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide
the first\r\nexample of passing the additional attributes of an error
(in this case,\r\nthe original CSV error) back to the client. The error
message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported
formats message**\r\n \r\nThe message that asks the user to upload the
logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img
width=\"741\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n
\r\nThe refactoring makes the \"→JSON\" conversion process more uniform
across\r\ndifferent chains and centralizes processor definitions
in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now
expects the LLM to follow the `SamplesFormat` when\r\nproviding the
information rather than an ad-hoc format.\r\n \r\nWhen testing, the
`fail` method is [not supported
in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it
is\r\nremoved.\r\n\r\nSee the PR for examples and
follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/194386","number":194386,"mergeCommit":{"message":"[Auto
Import] CSV format support (#194386)\n\n## Release
Notes\r\n\r\nAutomatic Import can now create integrations for logs in
the CSV format.\r\nOwing to the maturity of log format support, we thus
remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n##
Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is
#194342 \r\n\r\nWhen the user
adds a log sample whose format is recognized as CSV by the\r\nLLM, we
now parse the samples and insert
the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor
into the generated pipeline.\r\n\r\nIf the header is present, we use it
for the field names and add
a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor
that removes a header from the document stream by comparing\r\nthe
values to the header values.\r\n\r\nIf the header is missing, we ask the
LLM to generate a list of column\r\nnames, providing some context like
package and data stream title.\r\n\r\nShould the header or LLM
suggestion provide unsuitable for a specific\r\ncolumn, we use
`column1`, `column2` and so on as a fallback. To avoid\r\nduplicate
column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the
format appears to be CSV, but the `csv` processor returns fails,\r\nwe
bubble up an error using the recently
introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide
the first\r\nexample of passing the additional attributes of an error
(in this case,\r\nthe original CSV error) back to the client. The error
message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported
formats message**\r\n \r\nThe message that asks the user to upload the
logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img
width=\"741\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n
\r\nThe refactoring makes the \"→JSON\" conversion process more uniform
across\r\ndifferent chains and centralizes processor definitions
in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now
expects the LLM to follow the `SamplesFormat` when\r\nproviding the
information rather than an ad-hoc format.\r\n \r\nWhen testing, the
`fail` method is [not supported
in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it
is\r\nremoved.\r\n\r\nSee the PR for examples and
follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89"}}]}]
BACKPORT-->

Co-authored-by: Ilya Nikokoshev <[email protected]>
  • Loading branch information
kibanamachine and ilyannn authored Oct 14, 2024
1 parent 7a80e6f commit 6378ff3
Show file tree
Hide file tree
Showing 47 changed files with 853 additions and 132 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ export const logFormatDetectionTestState = {
exAnswer: 'testanswer',
packageName: 'testPackage',
dataStreamName: 'testDatastream',
packageTitle: 'Test Title',
dataStreamTitle: 'Test Datastream Title',
finalized: false,
samplesFormat: { name: SamplesFormatName.Values.structured },
header: true,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ import { z } from '@kbn/zod';
import {
PackageName,
DataStreamName,
PackageTitle,
DataStreamTitle,
LogSamples,
Connector,
LangSmithOptions,
Expand All @@ -29,6 +31,8 @@ export type AnalyzeLogsRequestBody = z.infer<typeof AnalyzeLogsRequestBody>;
export const AnalyzeLogsRequestBody = z.object({
packageName: PackageName,
dataStreamName: DataStreamName,
packageTitle: PackageTitle,
dataStreamTitle: DataStreamTitle,
logSamples: LogSamples,
connectorId: Connector,
langSmithOptions: LangSmithOptions.optional(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,17 @@ paths:
- connectorId
- packageName
- dataStreamName
- packageTitle
- dataStreamTitle
properties:
packageName:
$ref: "../model/common_attributes.schema.yaml#/components/schemas/PackageName"
dataStreamName:
$ref: "../model/common_attributes.schema.yaml#/components/schemas/DataStreamName"
packageTitle:
$ref: "../model/common_attributes.schema.yaml#/components/schemas/PackageTitle"
dataStreamTitle:
$ref: "../model/common_attributes.schema.yaml#/components/schemas/DataStreamTitle"
logSamples:
$ref: "../model/common_attributes.schema.yaml#/components/schemas/LogSamples"
connectorId:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { GenerationErrorCode } from '../constants';

// Errors raised by the generation process should provide information through this interface.
export interface GenerationErrorBody {
message: string;
attributes: GenerationErrorAttributes;
}

export function isGenerationErrorBody(obj: unknown | undefined): obj is GenerationErrorBody {
return (
typeof obj === 'object' &&
obj !== null &&
'message' in obj &&
typeof obj.message === 'string' &&
'attributes' in obj &&
obj.attributes !== undefined &&
isGenerationErrorAttributes(obj.attributes)
);
}

export interface GenerationErrorAttributes {
errorCode: GenerationErrorCode;
underlyingMessages: string[] | undefined;
}

export function isGenerationErrorAttributes(obj: unknown): obj is GenerationErrorAttributes {
return (
typeof obj === 'object' &&
obj !== null &&
'errorCode' in obj &&
typeof obj.errorCode === 'string' &&
(!('underlyingMessages' in obj) || Array.isArray(obj.underlyingMessages))
);
}
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,8 @@ export const getRelatedRequestMock = (): RelatedRequestBody => ({
export const getAnalyzeLogsRequestBody = (): AnalyzeLogsRequestBody => ({
dataStreamName: 'test-data-stream-name',
packageName: 'test-package-name',
packageTitle: 'Test package title',
dataStreamTitle: 'Test data stream title',
connectorId: 'test-connector-id',
logSamples: rawSamples,
});
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,18 @@ export const PackageName = z.string().min(1);
export type DataStreamName = z.infer<typeof DataStreamName>;
export const DataStreamName = z.string().min(1);

/**
* Package title for the integration to be built.
*/
export type PackageTitle = z.infer<typeof PackageTitle>;
export const PackageTitle = z.string().min(1);

/**
* DataStream title for the integration to be built.
*/
export type DataStreamTitle = z.infer<typeof DataStreamTitle>;
export const DataStreamTitle = z.string().min(1);

/**
* String form of the input logsamples.
*/
Expand Down Expand Up @@ -86,6 +98,14 @@ export const SamplesFormat = z.object({
* For some formats, specifies whether the samples can be multiline.
*/
multiline: z.boolean().optional(),
/**
* For CSV format, specifies whether the samples have a header row. For other formats, specifies the presence of header in each row.
*/
header: z.boolean().optional(),
/**
* For CSV format, specifies the column names proposed by the LLM.
*/
columns: z.array(z.string()).optional(),
/**
* For a JSON format, describes how to get to the sample array from the root of the JSON.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,16 @@ components:
minLength: 1
description: DataStream name for the integration to be built.

PackageTitle:
type: string
minLength: 1
description: Package title for the integration to be built.

DataStreamTitle:
type: string
minLength: 1
description: DataStream title for the integration to be built.

LogSamples:
type: array
items:
Expand Down Expand Up @@ -66,6 +76,14 @@ components:
multiline:
type: boolean
description: For some formats, specifies whether the samples can be multiline.
header:
type: boolean
description: For CSV format, specifies whether the samples have a header row. For other formats, specifies the presence of header in each row.
columns:
type: array
description: For CSV format, specifies the column names proposed by the LLM.
items:
type: string
json_path:
type: array
description: For a JSON format, describes how to get to the sample array from the root of the JSON.
Expand Down
3 changes: 2 additions & 1 deletion x-pack/plugins/integration_assistant/common/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,9 @@ export const MINIMUM_LICENSE_TYPE: LicenseType = 'enterprise';

// ErrorCodes

export enum ErrorCode {
export enum GenerationErrorCode {
RECURSION_LIMIT = 'recursion-limit',
RECURSION_LIMIT_ANALYZE_LOGS = 'recursion-limit-analyze-logs',
UNSUPPORTED_LOG_SAMPLES_FORMAT = 'unsupported-log-samples-format',
UNPARSEABLE_CSV_DATA = 'unparseable-csv-data',
}
3 changes: 1 addition & 2 deletions x-pack/plugins/integration_assistant/common/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,9 @@ export type {
Integration,
Pipeline,
Docs,
SamplesFormat,
LangSmithOptions,
} from './api/model/common_attributes.gen';
export { SamplesFormatName } from './api/model/common_attributes.gen';
export { SamplesFormat, SamplesFormatName } from './api/model/common_attributes.gen';
export type { ESProcessorItem } from './api/model/processor_attributes.gen';
export type { CelInput } from './api/model/cel_input_attributes.gen';

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,8 @@ describe('GenerationModal', () => {
it('should call runAnalyzeLogsGraph with correct parameters', () => {
expect(mockRunAnalyzeLogsGraph).toHaveBeenCalledWith({
...defaultRequest,
packageTitle: 'Mocked Integration title',
dataStreamTitle: 'Mocked Data Stream Title',
logSamples: integrationSettingsNonJSON.logSamples ?? [],
});
});
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ export const GenerationModal = React.memo<GenerationModalProps>(
{error ? (
<EuiFlexItem>
<EuiCallOut
title={i18n.GENERATION_ERROR(progressText[progress])}
title={i18n.GENERATION_ERROR_TITLE(progressText[progress])}
color="danger"
iconType="alert"
data-test-subj="generationErrorCallout"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -318,9 +318,6 @@ export const SampleLogsInput = React.memo<SampleLogsInputProps>(({ integrationSe
<EuiText size="s" textAlign="center">
{i18n.LOGS_SAMPLE_DESCRIPTION}
</EuiText>
<EuiText size="xs" color="subdued" textAlign="center">
{i18n.LOGS_SAMPLE_DESCRIPTION_2}
</EuiText>
</>
}
onChange={onChangeLogsSample}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
*/

import { i18n } from '@kbn/i18n';
import { ErrorCode } from '../../../../../../common/constants';
import { GenerationErrorCode } from '../../../../../../common/constants';
import type { GenerationErrorAttributes } from '../../../../../../common/api/generation_error';

export const INTEGRATION_NAME_TITLE = i18n.translate(
'xpack.integrationAssistant.step.dataStream.integrationNameTitle',
Expand Down Expand Up @@ -109,12 +110,6 @@ export const LOGS_SAMPLE_DESCRIPTION = i18n.translate(
defaultMessage: 'Drag and drop a file or Browse files.',
}
);
export const LOGS_SAMPLE_DESCRIPTION_2 = i18n.translate(
'xpack.integrationAssistant.step.dataStream.logsSample.description2',
{
defaultMessage: 'JSON/NDJSON format',
}
);
export const LOGS_SAMPLE_TRUNCATED = (maxRows: number) =>
i18n.translate('xpack.integrationAssistant.step.dataStream.logsSample.truncatedWarning', {
values: { maxRows },
Expand Down Expand Up @@ -188,7 +183,7 @@ export const PROGRESS_RELATED_GRAPH = i18n.translate(
defaultMessage: 'Generating related fields',
}
);
export const GENERATION_ERROR = (progressStep: string) =>
export const GENERATION_ERROR_TITLE = (progressStep: string) =>
i18n.translate('xpack.integrationAssistant.step.dataStream.generationError', {
values: { progressStep },
defaultMessage: 'An error occurred during: {progressStep}',
Expand All @@ -198,24 +193,44 @@ export const RETRY = i18n.translate('xpack.integrationAssistant.step.dataStream.
defaultMessage: 'Retry',
});

export const ERROR_TRANSLATION: Record<ErrorCode, string> = {
[ErrorCode.RECURSION_LIMIT_ANALYZE_LOGS]: i18n.translate(
export const GENERATION_ERROR_TRANSLATION: Record<
GenerationErrorCode,
string | ((attributes: GenerationErrorAttributes) => string)
> = {
[GenerationErrorCode.RECURSION_LIMIT_ANALYZE_LOGS]: i18n.translate(
'xpack.integrationAssistant.errors.recursionLimitAnalyzeLogsErrorMessage',
{
defaultMessage:
'Please verify the format of log samples is correct and try again. Try with a fewer samples if error persists.',
}
),
[ErrorCode.RECURSION_LIMIT]: i18n.translate(
[GenerationErrorCode.RECURSION_LIMIT]: i18n.translate(
'xpack.integrationAssistant.errors.recursionLimitReached',
{
defaultMessage: 'Max attempts exceeded. Please try again.',
}
),
[ErrorCode.UNSUPPORTED_LOG_SAMPLES_FORMAT]: i18n.translate(
[GenerationErrorCode.UNSUPPORTED_LOG_SAMPLES_FORMAT]: i18n.translate(
'xpack.integrationAssistant.errors.unsupportedLogSamples',
{
defaultMessage: 'Unsupported log format in the samples.',
}
),
[GenerationErrorCode.UNPARSEABLE_CSV_DATA]: (attributes) => {
if (
attributes.underlyingMessages !== undefined &&
attributes.underlyingMessages?.length !== 0
) {
return i18n.translate('xpack.integrationAssistant.errors.uparseableCSV.withReason', {
values: {
reason: attributes.underlyingMessages[0],
},
defaultMessage: `Cannot parse the samples as the CSV data (reason: {reason}). Please check the provided samples.`,
});
} else {
return i18n.translate('xpack.integrationAssistant.errors.uparseableCSV.withoutReason', {
defaultMessage: `Cannot parse the samples as the CSV data. Please check the provided samples.`,
});
}
},
};
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import {
type EcsMappingRequestBody,
type RelatedRequestBody,
} from '../../../../../../common';
import { isGenerationErrorBody } from '../../../../../../common/api/generation_error';
import {
runCategorizationGraph,
runEcsGraph,
Expand All @@ -26,7 +27,6 @@ import { useKibana } from '../../../../../common/hooks/use_kibana';
import type { State } from '../../state';
import * as i18n from './translations';
import { useTelemetry } from '../../../telemetry';
import type { ErrorCode } from '../../../../../../common/constants';
import type { AIConnector, IntegrationSettings } from '../../types';

export type OnComplete = (result: State['result']) => void;
Expand All @@ -46,6 +46,18 @@ interface RunGenerationProps {
setProgress: (progress: ProgressItem) => void;
}

// If the result is classified as a generation error, produce an error message
// as defined in the i18n file. Otherwise, return undefined.
function generationErrorMessage(body: unknown | undefined): string | undefined {
if (!isGenerationErrorBody(body)) {
return;
}

const errorCode = body.attributes.errorCode;
const translation = i18n.GENERATION_ERROR_TRANSLATION[errorCode];
return typeof translation === 'function' ? translation(body.attributes) : translation;
}

interface GenerationResults {
pipeline: Pipeline;
docs: Docs;
Expand Down Expand Up @@ -96,12 +108,7 @@ export const useGeneration = ({
error: originalErrorMessage,
});

let errorMessage = originalErrorMessage;
const errorCode = e.body?.attributes?.errorCode as ErrorCode | undefined;
if (errorCode != null) {
errorMessage = i18n.ERROR_TRANSLATION[errorCode];
}
setError(errorMessage);
setError(generationErrorMessage(e.body) ?? originalErrorMessage);
} finally {
setIsRequesting(false);
}
Expand Down Expand Up @@ -145,6 +152,9 @@ async function runGeneration({
const analyzeLogsRequest: AnalyzeLogsRequestBody = {
packageName: integrationSettings.name ?? '',
dataStreamName: integrationSettings.dataStreamName ?? '',
packageTitle: integrationSettings.title ?? integrationSettings.name ?? '',
dataStreamTitle:
integrationSettings.dataStreamTitle ?? integrationSettings.dataStreamName ?? '',
logSamples: integrationSettings.logSamples ?? [],
connectorId: connector.id,
langSmithOptions: getLangSmithOptions(),
Expand Down
Loading

0 comments on commit 6378ff3

Please sign in to comment.