Skip to content

Commit

Permalink
Generalize chunk data component (#757)
Browse files Browse the repository at this point in the history
Fixes [#14](ml6team/fondant-usecase-RAG#14)

Adds different chunking strategies that are available in langchain. 

Affected pipelines must be updated after a new release
  • Loading branch information
PhilippeMoussalli authored Jan 8, 2024
1 parent 665805d commit b7962a4
Show file tree
Hide file tree
Showing 5 changed files with 152 additions and 28 deletions.
17 changes: 13 additions & 4 deletions components/chunk_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,13 @@ Component that chunks text into smaller segments
This component takes a body of text and chunks into small chunks. The id of the returned dataset
consists of the id of the original document followed by the chunk index.

Different chunking strategies can be used. The default is to use the "recursive" strategy which
recursively splits the text into smaller chunks until the chunk size is reached.

More information about the different chunking strategies can be here:
- https://python.langchain.com/docs/modules/data_connection/document_transformers/
- https://www.pinecone.io/learn/chunking-strategies/


<a id="chunk_text#inputs_outputs"></a>
## Inputs / outputs
Expand Down Expand Up @@ -36,8 +43,9 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| chunk_size | int | Maximum size of chunks to return | / |
| chunk_overlap | int | Overlap in characters between chunks | / |
| chunk_strategy | int | The strategy to use for chunking the text. One of ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter', 'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter', 'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter', 'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character', 'NLTK', 'SpaCy'] | RecursiveCharacterTextSplitter |
| chunk_kwargs | dict | The arguments to pass to the chunking strategy | / |
| language_text_splitter | str | The programming language to use for splitting text into sentences if "language" is selected as the splitter. Check https://python.langchain.com/docs/modules/data_connection/document_transformers/code_splitter for more information on supported languages. | / |

<a id="chunk_text#usage"></a>
## Usage
Expand All @@ -56,8 +64,9 @@ dataset = dataset.apply(
"chunk_text",
arguments={
# Add arguments
# "chunk_size": 0,
# "chunk_overlap": 0,
# "chunk_strategy": "RecursiveCharacterTextSplitter",
# "chunk_kwargs": {},
# "language_text_splitter": ,
},
)
```
Expand Down
31 changes: 25 additions & 6 deletions components/chunk_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,13 @@ description: |
This component takes a body of text and chunks into small chunks. The id of the returned dataset
consists of the id of the original document followed by the chunk index.
Different chunking strategies can be used. The default is to use the "recursive" strategy which
recursively splits the text into smaller chunks until the chunk size is reached.
More information about the different chunking strategies can be here:
- https://python.langchain.com/docs/modules/data_connection/document_transformers/
- https://www.pinecone.io/learn/chunking-strategies/
image: fndnt/chunk_text:dev
tags:
- Text processing
Expand All @@ -22,9 +28,22 @@ produces:
previous_index: original_document_id

args:
chunk_size:
description: Maximum size of chunks to return
chunk_strategy:
description: The strategy to use for chunking the text. One of
['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter',
'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter',
'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter',
'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character',
'NLTK', 'SpaCy']
type: int
chunk_overlap:
description: Overlap in characters between chunks
type: int
default: RecursiveCharacterTextSplitter
chunk_kwargs:
description: The arguments to pass to the chunking strategy
type: dict
default: {}
language_text_splitter:
description: The programming language to use for splitting text into sentences if "language"
is selected as the splitter. Check https://python.langchain.com/docs/modules/data_connection/document_transformers/code_splitter
for more information on supported languages.
type: str
default: None
111 changes: 100 additions & 11 deletions components/chunk_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,37 +11,126 @@

import pandas as pd
from fondant.component import PandasTransformComponent
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import (
CharacterTextSplitter,
HTMLHeaderTextSplitter,
Language,
LatexTextSplitter,
MarkdownHeaderTextSplitter,
MarkdownTextSplitter,
NLTKTextSplitter,
PythonCodeTextSplitter,
RecursiveCharacterTextSplitter,
SentenceTransformersTokenTextSplitter,
SpacyTextSplitter,
TextSplitter,
TokenTextSplitter,
)

logger = logging.getLogger(__name__)


class ChunkTextComponent(PandasTransformComponent):
"""Component that chunks text into smaller segments.."""
"""Component that chunks text into smaller segments.
More information about the different chunking strategies can be here:
- https://python.langchain.com/docs/modules/data_connection/document_transformers/
- https://www.pinecone.io/learn/chunking-strategies/.
"""

def __init__(
self,
*,
chunk_size: int,
chunk_overlap: int,
chunk_strategy: t.Optional[str],
chunk_kwargs: t.Optional[dict],
language_text_splitter: t.Optional[str],
**kwargs,
):
"""
Args:
chunk_size: Maximum size of chunks to return.
chunk_overlap: Overlap in characters between chunks.
chunk_strategy: The strategy to use for chunking. One of
['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter',
'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter',
'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter',
'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character',
'NLTK', 'SpaCy']
chunk_kwargs: Keyword arguments to pass to the chunker class.
language_text_splitter: The programming language to use for splitting text into
sentences if "language" is selected as the splitter. Check
https://python.langchain.com/docs/modules/data_connection/document_transformers/
code_splitter
for more information on supported languages.
kwargs: Unhandled keyword arguments passed in by Fondant.
"""
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
)
self.chunk_strategy = chunk_strategy
self.chunk_kwargs = chunk_kwargs
self.chunker = self._get_chunker_class(chunk_strategy)
self.language_text_splitter = language_text_splitter

def _get_chunker_class(self, chunk_strategy: t.Optional[str]) -> TextSplitter:
"""
Function to retrieve chunker class by string
Args:
chunk_strategy: The strategy to use for chunking. One of
['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter',
'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter',
'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter',
'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character',
'NLTK', 'SpaCy', 'recursive'].
"""
class_dict = {
"RecursiveCharacterTextSplitter": RecursiveCharacterTextSplitter,
"HTMLHeaderTextSplitter": HTMLHeaderTextSplitter,
"CharacterTextSplitter": CharacterTextSplitter,
"Language": Language,
"MarkdownHeaderTextSplitter": MarkdownHeaderTextSplitter,
"MarkdownTextSplitter": MarkdownTextSplitter,
"SentenceTransformersTokenTextSplitter": SentenceTransformersTokenTextSplitter,
"LatexTextSplitter": LatexTextSplitter,
"SpacyTextSplitter": SpacyTextSplitter,
"TokenTextSplitter": TokenTextSplitter,
"NLTKTextSplitter": NLTKTextSplitter,
"PythonCodeTextSplitter": PythonCodeTextSplitter,
}

supported_chunk_strategies = list(class_dict.keys())

if chunk_strategy not in supported_chunk_strategies:
msg = f"Chunk strategy must be one of: {supported_chunk_strategies}"
raise ValueError(
msg,
)

if chunk_strategy == "Language":
supported_languages = [e.value for e in Language]

if self.language_text_splitter is None:
msg = (
f"Language text splitter must be specified when using Language"
f" chunking strategy, choose from: {supported_languages}"
)
raise ValueError(
msg,
)

if self.language_text_splitter not in supported_languages:
msg = f"Language text splitter must be one of: {supported_languages}"
raise ValueError(
msg,
)

return RecursiveCharacterTextSplitter.from_language(
language=Language(self.language_text_splitter),
**self.chunk_kwargs,
)

return class_dict[chunk_strategy](**self.chunk_kwargs)

def chunk_text(self, row) -> t.List[t.Tuple]:
# Multi-index df has id under the name attribute
doc_id = row.name
text_data = row["text"]
docs = self.text_splitter.create_documents([text_data])
docs = self.chunker.create_documents([text_data])

return [
(doc_id, f"{doc_id}_{chunk_id}", chunk.page_content)
for chunk_id, chunk in enumerate(docs)
Expand Down
5 changes: 3 additions & 2 deletions components/chunk_text/tests/chunk_text_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@ def test_transform():
)

component = ChunkTextComponent(
chunk_size=50,
chunk_overlap=20,
chunk_strategy="RecursiveCharacterTextSplitter",
chunk_kwargs={"chunk_size": 50, "chunk_overlap": 20},
language_text_splitter=None,
)

output_dataframe = component.transform(input_dataframe)
Expand Down
16 changes: 11 additions & 5 deletions components/index_aws_opensearch/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
# Index AWS OpenSearch

<a id="index_aws_opensearch#description"></a>
## Description
Component that takes embeddings of text snippets and indexes them into AWS OpenSearch vector database.

## Inputs / outputs
<a id="index_aws_opensearch#inputs_outputs"></a>
## Inputs / outputs

### Consumes
<a id="index_aws_opensearch#consumes"></a>
### Consumes
**This component consumes:**

- text: string
Expand All @@ -14,12 +17,13 @@ Component that takes embeddings of text snippets and indexes them into AWS OpenS




### Produces
<a id="index_aws_opensearch#produces"></a>
### Produces


**This component does not produce data.**

<a id="index_aws_opensearch#arguments"></a>
## Arguments

The component takes the following arguments to alter its behavior:
Expand All @@ -35,7 +39,8 @@ The component takes the following arguments to alter its behavior:
| verify_certs | bool | A boolean flag indicating whether to verify SSL certificates when connecting to the OpenSearch cluster. | True |
| pool_maxsize | int | The maximum size of the connection pool to the AWS OpenSearch cluster. | 20 |

## Usage
<a id="index_aws_opensearch#usage"></a>
## Usage

You can add this component to your pipeline using the following code:

Expand Down Expand Up @@ -65,6 +70,7 @@ dataset.write(
)
```

<a id="index_aws_opensearch#testing"></a>
## Testing

You can run the tests using docker with BuildKit. From this directory, run:
Expand Down

0 comments on commit b7962a4

Please sign in to comment.