Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowledge - Ingestion failure when adding lot of files to assistant. #491

Closed
sangee2004 opened this issue Sep 11, 2024 · 9 comments
Closed
Assignees
Labels
bug Something isn't working knowledge

Comments

@sangee2004
Copy link

sangee2004 commented Sep 11, 2024

Electron build - 18eaca7

Steps to reproduce the problem:

  1. Create an assistant and add all demo Knowledge files.
  2. Notice that ingestion of some of the files fail

Following error message gets presented in the UI -
ingestionerror.log

When I quit the assistant and again opened the assistant in edit mode, I see lesser number of files and the ingestion error message is very different now.
Screenshot 2024-09-10 at 5 22 07 PM

@cjellick
Copy link
Contributor

I dont understand the crux of this issue. Is it that ingestion shouldnt fail? Is it how failures are presented? Is it how retries are handled? is it that files go missing if they fail?

@sangee2004
Copy link
Author

sangee2004 commented Sep 11, 2024

Issue is with ingestion failure itself that was hit.

As a follow up to this issue,I thought that entering the edit mode of an assistant with knowledge files will attempt re ingesting the files again . I wanted to see if re ingestion of the same set of files would lead to getting better results. But I see an entirely different error message in this case.

@sangee2004
Copy link
Author

This issue is seen now seen when ingesting a smaller set of files (which used to take about 1 minute to ingest before)
Screenshot 2024-09-11 at 3 26 54 PM

After 7 -8 minutes , I see the following error presented to the user:

Screenshot 2024-09-11 at 3 32 42 PM

@sangee2004
Copy link
Author

@sangee2004
Copy link
Author

Tested with latest electron build - d27f9f6f9

With this build @iwilltry42 was able to figure out the the file that caused the ingestion failure which was a temporary docx file - "~$I_1230_Midas_250_1-24-23_redline.docx".

Once this file was removed , I was able to ingest all other files successfully.

Screenshot 2024-09-13 at 1 37 55 PM

Leaving this issue open for skipping the ingestion of such files automatically.

@iwilltry42
Copy link
Contributor

gptscript-ai/knowledge#124 includes extra ignore file patterns, which will ignore those files in the future (once this is merged).
Also, it's worth mentioning that we do not error out if the ingestion of a single file fails - we log the error and continue ingesting all other files but once everything's done, the command returns with an error if there was any error during the run - all other files would still be ingested correctly.

@iwilltry42
Copy link
Contributor

Change mentioned above should land in desktop with #526

@sangee2004
Copy link
Author

Tested with latest version of knowledge tool - v0.4.14-rc.11

When testing this scenario , we are seeing the following error:

2024-09-19T21:34:50.753Z [server] [ERROR] RangeError [ERR_CHILD_PROCESS_STDIO_MAXBUFFER]: stderr maxBuffer length exceeded
    at Socket.onChildStderr (node:child_process:519:14)
    at Socket.emit (node:events:519:28)
    at addChunk (node:internal/streams/readable:559:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
    at Readable.push (node:internal/streams/readable:390:5)
    at Pipe.onStreamRead (node:internal/stream_base_commons:191:23)
    at Pipe.callbackTrampoline (node:internal/async_hooks:130:17) {
  code: 'ERR_CHILD_PROCESS_STDIO_MAXBUFFER',
  cmd: '/Users/sangeethahariharan/acorn/desktop/bin/knowledge ingest --prune --dataset 985 ./data',
  stdout: '',
  stderr: '2024/09/19 14:34:22 INFO Created dataset id=985\n' +
    '2024/09/19 14:34:22 INFO Pruned files count=0 basePath=./data\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/2023 Acorn Labs, Inc. - Efile Forms.pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath=data/local/20220802102722514.pdf phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/2023 Acorn Labs, Inc. - Tax Return.pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/2022Credit card authorization Form_signed.pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Tax Return.pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/2022-08-13 - Horizons - Proposal EOR (Egypt) Acorn IO.pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/20220802102811649 (1).pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Starting document loader flow=ingestion rootPath=./data filepath="data/local/2023 Acorn Labs, Inc. - Payment Vouchers.pdf" phase=parse stage=documentloader status=starting\n' +
    '2024/09/19 14:34:22 INFO Loaded documents flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=parse stage=documentloader status=completed num_documents=2\n' +
    '2024/09/19 14:34:22 INFO Starting text splitter flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=parse stage=textsplitter num_documents=2 status=starting\n' +
    '2024/09/19 14:34:22 INFO Loaded documents flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=parse stage=documentloader status=completed num_documents=2\n' +
    '2024/09/19 14:34:22 INFO Starting text splitter flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=parse stage=textsplitter num_documents=2 status=starting\n' +
    '2024/09/19 14:34:22 INFO Loaded documents flow=ingestion rootPath=./data filepath="data/local/2022Credit card authorization Form_signed.pdf" phase=parse stage=documentloader status=completed num_documents=1\n' +
    '2024/09/19 14:34:22 INFO Starting text splitter flow=ingestion rootPath=./data filepath="data/local/2022Credit card authorization Form_signed.pdf" phase=parse stage=textsplitter num_documents=1 status=starting\n' +
    '2024/09/19 14:34:22 INFO Loaded documents flow=ingestion rootPath=./data filepath="data/local/2023 Acorn Labs, Inc. - Payment Vouchers.pdf" phase=parse stage=documentloader status=completed num_documents=8\n' +
    '2024/09/19 14:34:22 INFO Starting text splitter flow=ingestion rootPath=./data filepath="data/local/2023 Acorn Labs, Inc. - Payment Vouchers.pdf" phase=parse stage=textsplitter num_documents=8 status=starting\n' +
    '2024/09/19 14:34:22 INFO Loaded documents flow=ingestion rootPath=./data filepath="data/local/2023 Acorn Labs, Inc. - Efile Forms.pdf" phase=parse stage=documentloader status=completed num_documents=3\n' +
    '2024/09/19 14:34:22 INFO Starting text splitter flow=ingestion rootPath=./data filepath="data/local/2023 Acorn Labs, Inc. - Efile Forms.pdf" phase=parse stage=textsplitter num_documents=3 status=starting\n' +
    '2024/09/19 14:34:22 INFO Loaded documents flow=ingestion rootPath=./data filepath="data/local/2022-08-13 - Horizons - Proposal EOR (Egypt) Acorn IO.pdf" phase=parse stage=documentloader status=completed num_documents=4\n' +
    '2024/09/19 14:34:22 INFO Starting text splitter flow=ingestion rootPath=./data filepath="data/local/2022-08-13 - Horizons - Proposal EOR (Egypt) Acorn IO.pdf" phase=parse stage=textsplitter num_documents=4 status=starting\n' +
    '2024/09/19 14:34:22 INFO Split documents flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=parse stage=textsplitter num_documents=2 status=completed new_num_documents=3\n' +
    '2024/09/19 14:34:22 INFO Starting document transformers flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=parse stage=transformer num_documents=3 num_transformers=1 status=starting\n' +
    '2024/09/19 14:34:22 INFO Running transformer flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" transformer=extra_metadata progress=1/1 progress_unit=transformations\n' +
    '2024/09/19 14:34:22 INFO Transformed documents flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" transformer=extra_metadata progress=1/1 progress_unit=transformations status=completed num_documents=3\n' +
    '2024/09/19 14:34:22 INFO Transformed documents flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=parse stage=transformer num_documents=3 num_transformers=1 status=completed new_num_documents=3\n' +
    '2024/09/19 14:34:22 INFO Adding documents to collection (generating embeddings) flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=store num_documents=3 stage=vectorstore vectorstore=chromem-go status=starting\n' +
    '2024/09/19 14:34:22 INFO Creating embedding flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=store num_documents=3 stage=embedding status=starting\n' +
    '2024/09/19 14:34:22 INFO Creating embedding flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=store num_documents=3 stage=embedding status=starting\n' +
    '2024/09/19 14:34:22 INFO Creating embedding flow=ingestion rootPath=./data filepath="data/local/2023 Sponsorship Contract_SLC.pdf" phase=store num_documents=3 stage=embedding status=starting\n' +
    '2024/09/19 14:34:22 INFO Loaded documents flow=ingestion rootPath=./data filepath="data/local/20220802102811649 (1).pdf" phase=parse stage=documentloader status=completed num_documents=2\n' +
    '2024/09/19 14:34:22 INFO Starting text splitter flow=ingestion rootPath=./data filepath="data/local/20220802102811649 (1).pdf" phase=parse stage=textsplitter num_documents=2 status=starting\n' +
    'warning: cannot load object (367 0 R) into cache\n' +
    '2024/09/19 14:34:22 INFO Loaded documents flow=ingestion rootPath=./data filepath=data/local/20220802102722514.pdf phase=parse stage=documentloader status=completed num_documents=1\n' +
    '2024/09/19 14:34:22 INFO Starting text splitter flow=ingestion rootPath=./data filepath=data/local/20220802102722514.pdf phase=parse stage=textsplitter num_documents=1 status=starting\n' +
    '2024/09/19 14:34:22 INFO Split documents flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=parse stage=textsplitter num_documents=2 status=completed new_num_documents=4\n' +
    '2024/09/19 14:34:22 INFO Starting document transformers flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=parse stage=transformer num_documents=4 num_transformers=1 status=starting\n' +
    '2024/09/19 14:34:22 INFO Running transformer flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" transformer=extra_metadata progress=1/1 progress_unit=transformations\n' +
    '2024/09/19 14:34:22 INFO Transformed documents flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" transformer=extra_metadata progress=1/1 progress_unit=transformations status=completed num_documents=4\n' +
    '2024/09/19 14:34:22 INFO Transformed documents flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=parse stage=transformer num_documents=4 num_transformers=1 status=completed new_num_documents=4\n' +
    '2024/09/19 14:34:22 INFO Adding documents to collection (generating embeddings) flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=store num_documents=4 stage=vectorstore vectorstore=chromem-go status=starting\n' +
    '2024/09/19 14:34:22 INFO Creating embedding flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=store num_documents=4 stage=embedding status=starting\n' +
    '2024/09/19 14:34:22 INFO Creating embedding flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=store num_documents=4 stage=embedding status=starting\n' +
    '2024/09/19 14:34:22 INFO Creating embedding flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=store num_documents=4 stage=embedding status=starting\n' +
    '2024/09/19 14:34:22 INFO Creating embedding flow=ingestion rootPath=./data filepath="data/local/2022 Acorn Labs, LLC - Payment Voucher.pdf" phase=store num_documents=4 stage=embedding status=starting\n' +
    'warning: ... repeated 16 times...\n' +
    'warning: cannot load object (365 0 R) into cache\n' +
    'warning: ... repeated 8 times...\n' +
    'warning: cannot load object (366 0 R) into cache\n' +
    '2024/09/19 14:34:22 INFO Split documents flow=ingestion rootPath=./data filepath="data/local/2022-08-13 - Horizons - Proposal EOR (Egypt) Acorn IO.pdf" phase=parse stage=textsplitter num_documents=4 status=completed new_num_documents=6\n' +
    '2024/09/19 14:34:22 INFO Starting document transformers flow=ingestion rootPath=./data filepath="data/local/2022-08-13 - Horizons - Proposal EOR (Egypt) A'... 1038576 more characters,
  digest: '1209520691'
}
2024-09-19T21:34:50.755Z [client] [ERROR] Error: An error occurred in the Server Components render. The specific message is omitted in production builds to avoid leaking sensitive details. A digest property is included on this error instance which may provide additional details about the nature of the error.

UI shows the following error:

Image

@sangee2004
Copy link
Author

Tested with build from 01bb57f3c6fc4

This issue is not seen any more.

Able to ingest 500+ local files successfully along with ~$I_1230_Midas_250_1-24-23_redline.docx. There was no error reported in the UI during ingestion.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working knowledge
Projects
None yet
Development

No branches or pull requests

3 participants