Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: implement text chunking processor with fixed token length and d…
…elimiter algorithm (opensearch-project#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <[email protected]> * initialize node client for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <[email protected]> * chunker factory create with analysis registry Signed-off-by: yuye-aws <[email protected]> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <[email protected]> * add max token count parsing logic Signed-off-by: yuye-aws <[email protected]> * bug fix for non-existing index Signed-off-by: yuye-aws <[email protected]> * change error log Signed-off-by: yuye-aws <[email protected]> * implement evenly chunk Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * add error message for chunker factory tests Signed-off-by: yuye-aws <[email protected]> * resolve comments Signed-off-by: yuye-aws <[email protected]> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <[email protected]> * add default value logic back Signed-off-by: yuye-aws <[email protected]> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove system out println Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <[email protected]> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back deleted xml file Signed-off-by: yuye-aws <[email protected]> * restore xml file Signed-off-by: yuye-aws <[email protected]> * integration tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * add changelog Signed-off-by: yuye-aws <[email protected]> * update integration test for cascade processor Signed-off-by: yuye-aws <[email protected]> * add max chunk limit Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update error message Signed-off-by: yuye-aws <[email protected]> * change field UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change logic of max chunk number Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <[email protected]> Signed-off-by: yuye-aws <[email protected]> * fix unit tests for inference processor Signed-off-by: yuye-aws <[email protected]> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <[email protected]> * constructor for inference processor Signed-off-by: yuye-aws <[email protected]> * use inference processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <[email protected]> * api refactor for document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove nested list key for chunking processor Signed-off-by: yuye-aws <[email protected]> * remove unused function Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <[email protected]> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add default delimiter value Signed-off-by: Lu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <[email protected]> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <[email protected]> * implement overlap rate with big decimal Signed-off-by: yuye-aws <[email protected]> * update max chunk limit in delimiter Signed-off-by: yuye-aws <[email protected]> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <[email protected]> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * spotless apply for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize current chunk count Signed-off-by: yuye-aws <[email protected]> * parameter validation for max chunk limit Signed-off-by: yuye-aws <[email protected]> * fix integration tests Signed-off-by: yuye-aws <[email protected]> * fix current UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change delimiter UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove delimiter useless code Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * add more unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix import order Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix java doc error Signed-off-by: yuye-aws <[email protected]> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <[email protected]> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <[email protected]> * adjust method place Signed-off-by: yuye-aws <[email protected]> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <[email protected]> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <[email protected]> * make delimiter member variables static Signed-off-by: yuye-aws <[email protected]> * remove redundant set field value in execute method Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add integration tests with more tokenizers Signed-off-by: yuye-aws <[email protected]> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <[email protected]> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update chunker interface Signed-off-by: yuye-aws <[email protected]> * track chunkCount within function Signed-off-by: yuye-aws <[email protected]> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <[email protected]> * fix fixed length chunker Signed-off-by: xinyual <[email protected]> * fix delimiter chunker Signed-off-by: xinyual <[email protected]> * fix chunker factory Signed-off-by: xinyual <[email protected]> * fix UTs Signed-off-by: xinyual <[email protected]> * fix UT and chunker factory Signed-off-by: xinyual <[email protected]> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <[email protected]> * fix Uts Signed-off-by: xinyual <[email protected]> * avoid java doc change Signed-off-by: xinyual <[email protected]> * move validate to commonUtlis Signed-off-by: xinyual <[email protected]> * remove useless function Signed-off-by: xinyual <[email protected]> * change java doc Signed-off-by: xinyual <[email protected]> * fix Document process ut Signed-off-by: xinyual <[email protected]> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <[email protected]> * update exception message Signed-off-by: yuye-aws <[email protected]> * fix document chunking processor IT Signed-off-by: yuye-aws <[email protected]> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update changelog for 2.x release Signed-off-by: yuye-aws <[email protected]> * rename processor Signed-off-by: yuye-aws <[email protected]> * update default delimiter to be \n\n Signed-off-by: yuye-aws <[email protected]> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <[email protected]> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <[email protected]> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <[email protected]> * adjust functions in chunker interface Signed-off-by: yuye-aws <[email protected]> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <[email protected]> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker factory Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <[email protected]> * support range double in chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <[email protected]> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <[email protected]> * add comment in text chunking processor Signed-off-by: yuye-aws <[email protected]> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * make parameter final Signed-off-by: yuye-aws <[email protected]> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <[email protected]> * bug fix in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <[email protected]> * implement parser and validator Signed-off-by: yuye-aws <[email protected]> * update comment Signed-off-by: yuye-aws <[email protected]> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * use object nonnull and require nonnull Signed-off-by: yuye-aws <[email protected]> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <[email protected]> * merge parameter validator into the parser Signed-off-by: yuye-aws <[email protected]> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <[email protected]> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <[email protected]> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <[email protected]> * add unit test with non list of string Signed-off-by: yuye-aws <[email protected]> * add unit test with null input Signed-off-by: yuye-aws <[email protected]> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <[email protected]> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <[email protected]> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method modifier for all classes Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune exception type in parameter parser Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * allow 0 for max chunk limit Signed-off-by: yuye-aws <[email protected]> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <[email protected]> * tune code for chunker Signed-off-by: yuye-aws <[email protected]> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <[email protected]> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <[email protected]> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <[email protected]> * optimize code Signed-off-by: yuye-aws <[email protected]> * extract max chunk limit check to util class Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * fix unit tests Signed-off-by: yuye-aws <[email protected]> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Signed-off-by: xinyual <[email protected]> Signed-off-by: zane-neo <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: Lu <[email protected]> Co-authored-by: xinyual <[email protected]> Co-authored-by: zane-neo <[email protected]> Co-authored-by: Lu <[email protected]>
- Loading branch information