Generalize chunk data component (#757)

Fixes [#14](ml6team/fondant-usecase-RAG#14) Adds different chunking strategies that are available in langchain. Affected pipelines must be updated after a new release
ml6team · Jan 8, 2024 · b7962a4 · b7962a4
1 parent 665805d
commit b7962a4
Show file tree

Hide file tree

Showing 5 changed files with 152 additions and 28 deletions.
diff --git a/components/chunk_text/README.md b/components/chunk_text/README.md
@@ -7,6 +7,13 @@ Component that chunks text into smaller segments
 This component takes a body of text and chunks into small chunks. The id of the returned dataset
 consists of the id of the original document followed by the chunk index.
 
+Different chunking strategies can be used. The default is to use the "recursive" strategy which
+  recursively splits the text into smaller chunks until the chunk size is reached. 
+
+More information about the different chunking strategies can be here:
+- https://python.langchain.com/docs/modules/data_connection/document_transformers/
+- https://www.pinecone.io/learn/chunking-strategies/
+
 
 <a id="chunk_text#inputs_outputs"></a>
 ## Inputs / outputs 
@@ -36,8 +43,9 @@ The component takes the following arguments to alter its behavior:
 
 | argument | type | description | default |
 | -------- | ---- | ----------- | ------- |
-| chunk_size | int | Maximum size of chunks to return | / |
-| chunk_overlap | int | Overlap in characters between chunks | / |
+| chunk_strategy | int | The strategy to use for chunking the text. One of ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter', 'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter', 'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter', 'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character', 'NLTK', 'SpaCy'] | RecursiveCharacterTextSplitter |
+| chunk_kwargs | dict | The arguments to pass to the chunking strategy | / |
+| language_text_splitter | str | The programming language to use for splitting text into sentences if "language" is selected as the splitter. Check  https://python.langchain.com/docs/modules/data_connection/document_transformers/code_splitter for more information on supported languages. | / |
 
 <a id="chunk_text#usage"></a>
 ## Usage 
@@ -56,8 +64,9 @@ dataset = dataset.apply(
     "chunk_text",
     arguments={
         # Add arguments
-        # "chunk_size": 0,
-        # "chunk_overlap": 0,
+        # "chunk_strategy": "RecursiveCharacterTextSplitter",
+        # "chunk_kwargs": {},
+        # "language_text_splitter": ,
     },
 )
 ```

diff --git a/components/chunk_text/fondant_component.yaml b/components/chunk_text/fondant_component.yaml
@@ -4,7 +4,13 @@ description: |
   
   This component takes a body of text and chunks into small chunks. The id of the returned dataset
   consists of the id of the original document followed by the chunk index.
-
+  
+  Different chunking strategies can be used. The default is to use the "recursive" strategy which
+    recursively splits the text into smaller chunks until the chunk size is reached. 
+  
+  More information about the different chunking strategies can be here:
+  - https://python.langchain.com/docs/modules/data_connection/document_transformers/
+  - https://www.pinecone.io/learn/chunking-strategies/
 image: fndnt/chunk_text:dev
 tags:
   - Text processing
@@ -22,9 +28,22 @@ produces:
 previous_index: original_document_id
 
 args:
-  chunk_size:
-    description: Maximum size of chunks to return
+  chunk_strategy:
+    description: The strategy to use for chunking the text. One of
+            ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter',
+            'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter',
+            'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter',
+            'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character',
+            'NLTK', 'SpaCy']
     type: int
-  chunk_overlap:
-    description: Overlap in characters between chunks
-    type: int
+    default: RecursiveCharacterTextSplitter
+  chunk_kwargs:
+    description: The arguments to pass to the chunking strategy
+    type: dict
+    default: {}
+  language_text_splitter:
+    description: The programming language to use for splitting text into sentences if "language"
+        is selected as the splitter. Check  https://python.langchain.com/docs/modules/data_connection/document_transformers/code_splitter
+        for more information on supported languages.
+    type: str
+    default: None
diff --git a/components/chunk_text/src/main.py b/components/chunk_text/src/main.py
@@ -11,37 +11,126 @@
 
 import pandas as pd
 from fondant.component import PandasTransformComponent
-from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain.text_splitter import (
+    CharacterTextSplitter,
+    HTMLHeaderTextSplitter,
+    Language,
+    LatexTextSplitter,
+    MarkdownHeaderTextSplitter,
+    MarkdownTextSplitter,
+    NLTKTextSplitter,
+    PythonCodeTextSplitter,
+    RecursiveCharacterTextSplitter,
+    SentenceTransformersTokenTextSplitter,
+    SpacyTextSplitter,
+    TextSplitter,
+    TokenTextSplitter,
+)
 
 logger = logging.getLogger(__name__)
 
 
 class ChunkTextComponent(PandasTransformComponent):
-    """Component that chunks text into smaller segments.."""
+    """Component that chunks text into smaller segments.
+    More information about the different chunking strategies can be here:
+      - https://python.langchain.com/docs/modules/data_connection/document_transformers/
+      - https://www.pinecone.io/learn/chunking-strategies/.
+    """
 
     def __init__(
         self,
         *,
-        chunk_size: int,
-        chunk_overlap: int,
+        chunk_strategy: t.Optional[str],
+        chunk_kwargs: t.Optional[dict],
+        language_text_splitter: t.Optional[str],
         **kwargs,
     ):
         """
         Args:
-            chunk_size: Maximum size of chunks to return.
-            chunk_overlap: Overlap in characters between chunks.
+            chunk_strategy: The strategy to use for chunking. One of
+            ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter',
+            'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter',
+            'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter',
+            'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character',
+            'NLTK', 'SpaCy']
+            chunk_kwargs: Keyword arguments to pass to the chunker class.
+            language_text_splitter: The programming language to use for splitting text into
+            sentences if "language" is selected as the splitter. Check
+            https://python.langchain.com/docs/modules/data_connection/document_transformers/
+            code_splitter
+            for more information on supported languages.
             kwargs: Unhandled keyword arguments passed in by Fondant.
         """
-        self.text_splitter = RecursiveCharacterTextSplitter(
-            chunk_size=chunk_size,
-            chunk_overlap=chunk_overlap,
-        )
+        self.chunk_strategy = chunk_strategy
+        self.chunk_kwargs = chunk_kwargs
+        self.chunker = self._get_chunker_class(chunk_strategy)
+        self.language_text_splitter = language_text_splitter
+
+    def _get_chunker_class(self, chunk_strategy: t.Optional[str]) -> TextSplitter:
+        """
+        Function to retrieve chunker class by string
+        Args:
+            chunk_strategy: The strategy to use for chunking. One of
+            ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter',
+            'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter',
+            'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter',
+            'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character',
+            'NLTK', 'SpaCy', 'recursive'].
+        """
+        class_dict = {
+            "RecursiveCharacterTextSplitter": RecursiveCharacterTextSplitter,
+            "HTMLHeaderTextSplitter": HTMLHeaderTextSplitter,
+            "CharacterTextSplitter": CharacterTextSplitter,
+            "Language": Language,
+            "MarkdownHeaderTextSplitter": MarkdownHeaderTextSplitter,
+            "MarkdownTextSplitter": MarkdownTextSplitter,
+            "SentenceTransformersTokenTextSplitter": SentenceTransformersTokenTextSplitter,
+            "LatexTextSplitter": LatexTextSplitter,
+            "SpacyTextSplitter": SpacyTextSplitter,
+            "TokenTextSplitter": TokenTextSplitter,
+            "NLTKTextSplitter": NLTKTextSplitter,
+            "PythonCodeTextSplitter": PythonCodeTextSplitter,
+        }
+
+        supported_chunk_strategies = list(class_dict.keys())
+
+        if chunk_strategy not in supported_chunk_strategies:
+            msg = f"Chunk strategy must be one of: {supported_chunk_strategies}"
+            raise ValueError(
+                msg,
+            )
+
+        if chunk_strategy == "Language":
+            supported_languages = [e.value for e in Language]
+
+            if self.language_text_splitter is None:
+                msg = (
+                    f"Language text splitter must be specified when using Language"
+                    f" chunking strategy, choose from: {supported_languages}"
+                )
+                raise ValueError(
+                    msg,
+                )
+
+            if self.language_text_splitter not in supported_languages:
+                msg = f"Language text splitter must be one of: {supported_languages}"
+                raise ValueError(
+                    msg,
+                )
+
+            return RecursiveCharacterTextSplitter.from_language(
+                language=Language(self.language_text_splitter),
+                **self.chunk_kwargs,
+            )
+
+        return class_dict[chunk_strategy](**self.chunk_kwargs)
 
     def chunk_text(self, row) -> t.List[t.Tuple]:
         # Multi-index df has id under the name attribute
         doc_id = row.name
         text_data = row["text"]
-        docs = self.text_splitter.create_documents([text_data])
+        docs = self.chunker.create_documents([text_data])
+
         return [
             (doc_id, f"{doc_id}_{chunk_id}", chunk.page_content)
             for chunk_id, chunk in enumerate(docs)

diff --git a/components/chunk_text/tests/chunk_text_test.py b/components/chunk_text/tests/chunk_text_test.py
@@ -17,8 +17,9 @@ def test_transform():
     )
 
     component = ChunkTextComponent(
-        chunk_size=50,
-        chunk_overlap=20,
+        chunk_strategy="RecursiveCharacterTextSplitter",
+        chunk_kwargs={"chunk_size": 50, "chunk_overlap": 20},
+        language_text_splitter=None,
     )
 
     output_dataframe = component.transform(input_dataframe)

diff --git a/components/index_aws_opensearch/README.md b/components/index_aws_opensearch/README.md
@@ -1,11 +1,14 @@
 # Index AWS OpenSearch
 
+<a id="index_aws_opensearch#description"></a>
 ## Description
 Component that takes embeddings of text snippets and indexes them into AWS OpenSearch vector database.
 
-## Inputs / outputs
+<a id="index_aws_opensearch#inputs_outputs"></a>
+## Inputs / outputs 
 
-### Consumes
+<a id="index_aws_opensearch#consumes"></a>
+### Consumes 
 **This component consumes:**
 
 - text: string
@@ -14,12 +17,13 @@ Component that takes embeddings of text snippets and indexes them into AWS OpenS
 
 
 
-
-### Produces
+<a id="index_aws_opensearch#produces"></a>  
+### Produces 
 
 
 **This component does not produce data.**
 
+<a id="index_aws_opensearch#arguments"></a>
 ## Arguments
 
 The component takes the following arguments to alter its behavior:
@@ -35,7 +39,8 @@ The component takes the following arguments to alter its behavior:
 | verify_certs | bool | A boolean flag indicating whether to verify SSL certificates when connecting to the OpenSearch cluster. | True |
 | pool_maxsize | int | The maximum size of the connection pool to the AWS OpenSearch cluster. | 20 |
 
-## Usage
+<a id="index_aws_opensearch#usage"></a>
+## Usage 
 
 You can add this component to your pipeline using the following code:
 
@@ -65,6 +70,7 @@ dataset.write(
 )
 ```
 
+<a id="index_aws_opensearch#testing"></a>
 ## Testing
 
 You can run the tests using docker with BuildKit. From this directory, run: