Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] File upload: Adds support for PDF files #186956

Merged
merged 81 commits into from
Aug 22, 2024

Conversation

jgowdyelastic
Copy link
Member

@jgowdyelastic jgowdyelastic commented Jun 26, 2024

Also txt, rtf, doc, docx, xls, xlsx, ppt, pptx, odt, ods, and odp.

Adds the ability to automatically add a semantic text field to the mappings and a copy_to processor to duplicate the field. This is needed for the mappings generated for the attachment processor which adds a nested attachment.content field which cannot be used as a semantic text field.

After a successful import, a link to Search's Playground app is shown. Navigating there lets the user instantly query the newly uploaded file.

2024-07-24.20-21-45.2024-07-24.20_22_53.mp4

@jgowdyelastic jgowdyelastic added the ci:cloud-deploy Create or update a Cloud deployment label Jun 26, 2024
@jgowdyelastic
Copy link
Member Author

/ci

@jgowdyelastic
Copy link
Member Author

/ci

@jgowdyelastic
Copy link
Member Author

/ci

@jgowdyelastic
Copy link
Member Author

/ci

@jgowdyelastic jgowdyelastic added the ci:cloud-redeploy Always create a new Cloud deployment label Jun 26, 2024
@jgowdyelastic
Copy link
Member Author

/ci

@jgowdyelastic
Copy link
Member Author

/ci

@serenachou
Copy link

Neither of these plugins are touched in this PR and so I'd suggest any work to add the dev console should be done in a follow up PR.

Thank you for the guidance, this makes perfect sense!

@serenachou
Copy link

I've added an info callout about the auto semantic text field which is shown when importing these new file types.

Really excited to see this call out, this is fantastic!

perhaps some text that I'll propose since I believe Istvan is on vacation right now and to keep it in sync with what we put on the Index Mappings page:
image

semantic_text field type now available!
Using the semantic_text field type enables better semantic search on the uploaded file content. In the "Advanced" tab, add an additional field and choose 'semantic_text' to get started.

@jeffvestal
Copy link

Really excited to see this call out, this is fantastic!

+1 having this call out and the ability to add the semantic_text field easily is awesome!

Copy link
Contributor

@nreese nreese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kibana-gis changes LGTM
code review only

tags: ['access:fileUpload:analyzeFile'],
body: {
accepts: ['application/json'],
maxBytes: MAX_FILE_SIZE_BYTES,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be MAX_TIKA_FILE_SIZE_BYTES?

Copy link
Member Author

@jgowdyelastic jgowdyelastic Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, thanks. Updated in e756a3f

@jgowdyelastic
Copy link
Member Author

jgowdyelastic commented Aug 20, 2024

@serenachou I've gone with a combination of your suggestion and my original text. with a link out to the semantic_text documentation.

image

@kibana-ci
Copy link
Collaborator

kibana-ci commented Aug 21, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
dataVisualizer 801 811 +10
fileUpload 217 218 +1
total +11

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fileUpload 84 88 +4

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
dataVisualizer 799.8KB 811.0KB +11.2KB
fileUpload 951.7KB 951.7KB +2.0B
total +11.2KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
dataVisualizer 24.4KB 24.4KB +12.0B
fileUpload 13.8KB 14.8KB +1.1KB
total +1.1KB
Unknown metric groups

API count

id before after diff
fileUpload 84 88 +4

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @jgowdyelastic

Copy link
Contributor

@alvarezmelissa87 alvarezmelissa87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes LGTM ⚡

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest change LGTM. Tested with a selection of pdf, txt and docx files.

@jgowdyelastic jgowdyelastic merged commit 3177b03 into elastic:main Aug 22, 2024
25 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Aug 22, 2024
jgowdyelastic added a commit that referenced this pull request Sep 25, 2024
With some datasets the find structure api will not generate an ingest
pipeline. A recent
[change](#186956) to how we catch
and display errors during file upload means an upload with no pipeline
now produces an error which aborts the upload.
Previously all pipeline creation errors were ignored and hidden from the
user.

This PR changes changes the file upload endpoint to allow it to receive
no ingest pipeline and also changes the UI to not display the pipeline
creation step during upload.

This file can be used to test the fix.
https://github.com/elastic/eland/blob/main/tests/flights.json.gz
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Sep 25, 2024
With some datasets the find structure api will not generate an ingest
pipeline. A recent
[change](elastic#186956) to how we catch
and display errors during file upload means an upload with no pipeline
now produces an error which aborts the upload.
Previously all pipeline creation errors were ignored and hidden from the
user.

This PR changes changes the file upload endpoint to allow it to receive
no ingest pipeline and also changes the UI to not display the pipeline
creation step during upload.

This file can be used to test the fix.
https://github.com/elastic/eland/blob/main/tests/flights.json.gz

(cherry picked from commit ee1a147)
kibanamachine added a commit that referenced this pull request Sep 25, 2024
# Backport

This will backport the following commits from `main` to `8.x`:
- [[ML] Fix file upload with no ingest pipeline
(#193744)](#193744)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"James
Gowdy","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-25T14:30:30Z","message":"[ML]
Fix file upload with no ingest pipeline (#193744)\n\nWith some datasets
the find structure api will not generate an ingest\r\npipeline. A
recent\r\n[change](#186956) to how
we catch\r\nand display errors during file upload means an upload with
no pipeline\r\nnow produces an error which aborts the
upload.\r\nPreviously all pipeline creation errors were ignored and
hidden from the\r\nuser.\r\n\r\nThis PR changes changes the file upload
endpoint to allow it to receive\r\nno ingest pipeline and also changes
the UI to not display the pipeline\r\ncreation step during
upload.\r\n\r\nThis file can be used to test the
fix.\r\nhttps://github.com/elastic/eland/blob/main/tests/flights.json.gz","sha":"ee1a147baca52dca5703663d35b66e7c44f3b676","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix",":ml","Feature:File
and Index Data Viz","Feature:File
Upload","v9.0.0","v8.16.0"],"title":"[ML] Fix file upload with no ingest
pipeline","number":193744,"url":"https://github.com/elastic/kibana/pull/193744","mergeCommit":{"message":"[ML]
Fix file upload with no ingest pipeline (#193744)\n\nWith some datasets
the find structure api will not generate an ingest\r\npipeline. A
recent\r\n[change](#186956) to how
we catch\r\nand display errors during file upload means an upload with
no pipeline\r\nnow produces an error which aborts the
upload.\r\nPreviously all pipeline creation errors were ignored and
hidden from the\r\nuser.\r\n\r\nThis PR changes changes the file upload
endpoint to allow it to receive\r\nno ingest pipeline and also changes
the UI to not display the pipeline\r\ncreation step during
upload.\r\n\r\nThis file can be used to test the
fix.\r\nhttps://github.com/elastic/eland/blob/main/tests/flights.json.gz","sha":"ee1a147baca52dca5703663d35b66e7c44f3b676"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/193744","number":193744,"mergeCommit":{"message":"[ML]
Fix file upload with no ingest pipeline (#193744)\n\nWith some datasets
the find structure api will not generate an ingest\r\npipeline. A
recent\r\n[change](#186956) to how
we catch\r\nand display errors during file upload means an upload with
no pipeline\r\nnow produces an error which aborts the
upload.\r\nPreviously all pipeline creation errors were ignored and
hidden from the\r\nuser.\r\n\r\nThis PR changes changes the file upload
endpoint to allow it to receive\r\nno ingest pipeline and also changes
the UI to not display the pipeline\r\ncreation step during
upload.\r\n\r\nThis file can be used to test the
fix.\r\nhttps://github.com/elastic/eland/blob/main/tests/flights.json.gz","sha":"ee1a147baca52dca5703663d35b66e7c44f3b676"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: James Gowdy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:build-cloud-image ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project Feature:File and Index Data Viz ML file and index data visualizer Feature:File Upload :ml release_note:feature Makes this part of the condensed release notes v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.