gather chunks before uploading docs to index for correct ID count #1159
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Description
When there are multiple data paths (
data_path
) specified in the config.json, theupload_documents_to_index
function originally executes during each data path, where the ID starts from 0. Therefore, there will always be duplicated IDs between the follow up chunks from the other data paths. This issue was found where some data went missing from the search index created.Here's an easy way to fix it. Don't upload until all chunks are ready and gathered into one array.
Contribution Checklist
For frontend changes, I have pulled the latest code from main, built the frontend, and committed all static files.does not apply