Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: implement CreateNlpBatchesFromIndexTask and BatchNlpTask #1597

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

ClemDoum
Copy link
Contributor

@ClemDoum ClemDoum commented Oct 23, 2024

TODO

PR description

Implemenent batch processing for NER, this change is made in the context of #1452, as batch processing is necessary for Spacy.

Notes

In this PR we made the choice not to implement PipelineTask but in contrast fully rely on the task bus to distribute the batches across workers

Changes

datashare-api

Added

  • added the Searcher sort(String field, SortOrder order) method to Indexer.Searcher to sort search results and be able to return documents grouped by language (to avoid model reload)

datashare-app

Added

  • added the CreateNlpBatchesFromIndexTask task which scan the index for document sorted by language. Documents are then added by batch to the BatchNlpTask queue, where workers will process document to perform NER by batch
  • added the BatchNlpTask which consumes document by batches, fetches them from the index and performs the NLP task (NER only)

@ClemDoum ClemDoum force-pushed the feature/batch-nlp-task branch 3 times, most recently from 5520b8c to 06941ae Compare November 7, 2024 09:29
@ClemDoum ClemDoum mentioned this pull request Nov 7, 2024
1 task
@ClemDoum ClemDoum marked this pull request as ready for review November 7, 2024 09:30
@ClemDoum ClemDoum force-pushed the feature/batch-nlp-task branch 2 times, most recently from 70042eb to 820b524 Compare November 7, 2024 11:07
@ClemDoum ClemDoum requested a review from a team November 7, 2024 11:15
@ClemDoum ClemDoum changed the title feature: implement BatchEnqueueFromIndexTask.java feature: implement CreateNlpBatchesFromIndexTask and BatchNlpTask Nov 7, 2024
@ClemDoum ClemDoum removed the request for review from a team November 7, 2024 16:34
@ClemDoum ClemDoum marked this pull request as draft November 7, 2024 16:35
@ClemDoum ClemDoum force-pushed the feature/batch-nlp-task branch 2 times, most recently from 7c365c8 to ec856be Compare November 12, 2024 12:19
@ClemDoum ClemDoum self-assigned this Nov 19, 2024
@ClemDoum ClemDoum marked this pull request as ready for review November 20, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant