-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add load_with_llamahub component (#719)
This component is being ported from the use case repo [here](https://github.com/ml6team/fondant-usecase-RAG/tree/main/src/components/load_with_llamahub). I split the inclusion and the changes I needed to make in separate commits so it's easy to review.
- Loading branch information
1 parent
589e327
commit 13ba6dc
Showing
10 changed files
with
332 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
FROM --platform=linux/amd64 python:3.8-slim as base | ||
|
||
# System dependencies | ||
RUN apt-get update && \ | ||
apt-get upgrade -y && \ | ||
apt-get install git -y | ||
|
||
# Install requirements | ||
COPY requirements.txt / | ||
RUN pip3 install --no-cache-dir -r requirements.txt | ||
|
||
# Install Fondant | ||
# This is split from other requirements to leverage caching | ||
ARG FONDANT_VERSION=main | ||
RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION} | ||
|
||
# Set the working directory to the component folder | ||
WORKDIR /component | ||
COPY src/ src/ | ||
|
||
FROM base as test | ||
COPY tests/ tests/ | ||
RUN pip3 install --no-cache-dir -r tests/requirements.txt | ||
RUN python -m pytest tests | ||
|
||
FROM base | ||
WORKDIR /component/src | ||
ENTRYPOINT ["fondant", "execute", "main"] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Load with LlamaHub | ||
|
||
### Description | ||
Load data using a LlamaHub loader. For available loaders, check the | ||
[LlamaHub](https://llamahub.ai/). | ||
|
||
|
||
### Inputs / outputs | ||
|
||
**This component consumes no data.** | ||
|
||
**This component produces no data.** | ||
|
||
### Arguments | ||
|
||
The component takes the following arguments to alter its behavior: | ||
|
||
| argument | type | description | default | | ||
| -------- | ---- | ----------- | ------- | | ||
| loader_class | str | The name of the LlamaIndex loader class to use. Make sure to provide the name and not the id. The name is passed to `llama_index.download_loader` to download the specified loader. | / | | ||
| loader_kwargs | str | Keyword arguments to pass when instantiating the loader class. Check the documentation of the loader to check which arguments it accepts. | / | | ||
| load_kwargs | str | Keyword arguments to pass to the `.load()` method of the loader. Check the documentation ofthe loader to check which arguments it accepts. | / | | ||
| additional_requirements | list | Some loaders require additional dependencies to be installed. You can specify those here. Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately additional requirements for LlamaIndex loaders are not documented well, but if a dependencyis missing, a clear error message will be thrown. | / | | ||
| n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale | / | | ||
| index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / | | ||
|
||
### Usage | ||
|
||
You can add this component to your pipeline using the following code: | ||
|
||
```python | ||
from fondant.pipeline import Pipeline | ||
|
||
|
||
pipeline = Pipeline(...) | ||
|
||
dataset = pipeline.read( | ||
"load_with_llamahub", | ||
arguments={ | ||
# Add arguments | ||
# "loader_class": , | ||
# "loader_kwargs": , | ||
# "load_kwargs": , | ||
# "additional_requirements": [], | ||
# "n_rows_to_load": 0, | ||
# "index_column": , | ||
} | ||
) | ||
``` | ||
|
||
### Testing | ||
|
||
You can run the tests using docker with BuildKit. From this directory, run: | ||
``` | ||
docker build . --target test | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
name: Load with LlamaHub | ||
description: | | ||
Load data using a LlamaHub loader. For available loaders, check the | ||
[LlamaHub](https://llamahub.ai/). | ||
image: fndnt/load_with_llamahub:dev | ||
tags: | ||
- Data loading | ||
|
||
produces: | ||
additionalProperties: true | ||
|
||
args: | ||
loader_class: | ||
description: | | ||
The name of the LlamaIndex loader class to use. Make sure to provide the name and not the | ||
id. The name is passed to `llama_index.download_loader` to download the specified loader. | ||
type: str | ||
loader_kwargs: | ||
description: | | ||
Keyword arguments to pass when instantiating the loader class. Check the documentation of | ||
the loader to check which arguments it accepts. | ||
type: str | ||
load_kwargs: | ||
description: | | ||
Keyword arguments to pass to the `.load()` method of the loader. Check the documentation of | ||
the loader to check which arguments it accepts. | ||
type: str | ||
additional_requirements: | ||
description: | | ||
Some loaders require additional dependencies to be installed. You can specify those here. | ||
Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately | ||
additional requirements for LlamaIndex loaders are not documented well, but if a dependency | ||
is missing, a clear error message will be thrown. | ||
type: list | ||
default: [] | ||
n_rows_to_load: | ||
description: | | ||
Optional argument that defines the number of rows to load. Useful for testing pipeline runs | ||
on a small scale | ||
type: int | ||
default: None | ||
index_column: | ||
description: | | ||
Column to set index to in the load component, if not specified a default globally unique | ||
index will be set | ||
type: str | ||
default: None |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
llama-index==0.9.9 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
import logging | ||
import subprocess | ||
import sys | ||
import typing as t | ||
from collections import defaultdict | ||
|
||
import dask.dataframe as dd | ||
import pandas as pd | ||
from fondant.component import DaskLoadComponent | ||
from fondant.core.component_spec import ComponentSpec | ||
from llama_index import download_loader | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class LlamaHubReader(DaskLoadComponent): | ||
def __init__( | ||
self, | ||
spec: ComponentSpec, | ||
*, | ||
loader_class: str, | ||
loader_kwargs: dict, | ||
load_kwargs: dict, | ||
additional_requirements: t.List[str], | ||
n_rows_to_load: t.Optional[int] = None, | ||
index_column: t.Optional[str] = None, | ||
) -> None: | ||
""" | ||
Args: | ||
spec: the component spec | ||
loader_class: The name of the LlamaIndex loader class to use | ||
loader_kwargs: Keyword arguments to pass when instantiating the loader class | ||
load_kwargs: Keyword arguments to pass to the `.load()` method of the loader | ||
additional_requirements: Additional Python requirements to install | ||
n_rows_to_load: optional argument that defines the number of rows to load. | ||
Useful for testing pipeline runs on a small scale. | ||
index_column: Column to set index to in the load component, if not specified a default | ||
globally unique index will be set. | ||
""" | ||
self.n_rows_to_load = n_rows_to_load | ||
self.index_column = index_column | ||
self.spec = spec | ||
|
||
self.install_additional_requirements(additional_requirements) | ||
|
||
loader_cls = download_loader(loader_class) | ||
self.loader = loader_cls(**loader_kwargs) | ||
self.load_kwargs = load_kwargs | ||
|
||
@staticmethod | ||
def install_additional_requirements(additional_requirements: t.List[str]): | ||
for requirement in additional_requirements: | ||
subprocess.check_call( # nosec | ||
[sys.executable, "-m", "pip", "install", requirement], | ||
) | ||
|
||
def set_df_index(self, dask_df: dd.DataFrame) -> dd.DataFrame: | ||
if self.index_column is None: | ||
logger.info( | ||
"Index column not specified, setting a globally unique index", | ||
) | ||
|
||
def _set_unique_index(dataframe: pd.DataFrame, partition_info=None): | ||
"""Function that sets a unique index based on the partition and row number.""" | ||
dataframe["id"] = 1 | ||
dataframe["id"] = ( | ||
str(partition_info["number"]) | ||
+ "_" | ||
+ (dataframe.id.cumsum()).astype(str) | ||
) | ||
dataframe.index = dataframe.pop("id") | ||
return dataframe | ||
|
||
def _get_meta_df() -> pd.DataFrame: | ||
meta_dict = {"id": pd.Series(dtype="object")} | ||
for field_name, field in self.spec.produces.items(): | ||
meta_dict[field_name] = pd.Series( | ||
dtype=pd.ArrowDtype(field.type.value), | ||
) | ||
return pd.DataFrame(meta_dict).set_index("id") | ||
|
||
meta = _get_meta_df() | ||
dask_df = dask_df.map_partitions(_set_unique_index, meta=meta) | ||
else: | ||
logger.info(f"Setting `{self.index_column}` as index") | ||
dask_df = dask_df.set_index(self.index_column, drop=True) | ||
|
||
return dask_df | ||
|
||
def load(self) -> dd.DataFrame: | ||
try: | ||
documents = self.loader.lazy_load_data(**self.load_kwargs) | ||
except NotImplementedError: | ||
documents = self.loader.load_data(**self.load_kwargs) | ||
|
||
doc_dict = defaultdict(list) | ||
for d, document in enumerate(documents): | ||
for column in self.spec.produces: | ||
if column == "text": | ||
doc_dict["text"].append(document.text) | ||
else: | ||
doc_dict[column].append(document.metadata.get(column)) | ||
|
||
if d == self.n_rows_to_load: | ||
break | ||
|
||
dask_df = dd.from_dict(doc_dict, npartitions=1) | ||
|
||
dask_df = self.set_df_index(dask_df) | ||
return dask_df |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
from pathlib import Path | ||
|
||
import yaml | ||
from fondant.core.component_spec import ComponentSpec | ||
|
||
from src.main import LlamaHubReader | ||
|
||
|
||
def test_arxiv_reader(): | ||
"""Test the component with the ArxivReader. | ||
This test requires a stable internet connection, both to download the loader, and to download | ||
the papers from Arxiv. | ||
""" | ||
with open(Path(__file__).with_name("fondant_component.yaml")) as f: | ||
spec = yaml.safe_load(f) | ||
spec = ComponentSpec(spec) | ||
|
||
component = LlamaHubReader( | ||
spec=spec, | ||
loader_class="ArxivReader", | ||
loader_kwargs={}, | ||
load_kwargs={ | ||
"search_query": "jeff dean", | ||
"max_results": 5, | ||
}, | ||
additional_requirements=["pypdf"], | ||
n_rows_to_load=None, | ||
index_column=None, | ||
) | ||
|
||
output_dataframe = component.load().compute() | ||
|
||
assert len(output_dataframe) > 0 | ||
assert output_dataframe.columns.tolist() == ["text", "URL", "Title of this paper"] |
50 changes: 50 additions & 0 deletions
50
components/load_with_llamahub/tests/fondant_component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
name: Load with LlamaHub | ||
description: | | ||
Load data using a LlamaHub loader. For available loaders, check the | ||
[LlamaHub](https://llamahub.ai/). | ||
image: ghcr.io/ml6team/load_with_llamahub:dev | ||
|
||
produces: | ||
text: | ||
type: string | ||
URL: | ||
type: string | ||
Title of this paper: | ||
type: string | ||
|
||
args: | ||
loader_class: | ||
description: | | ||
The name of the LlamaIndex loader class to use. Make sure to provide the name and not the | ||
id. The name is passed to `llama_index.download_loader` to download the specified loader. | ||
type: str | ||
loader_kwargs: | ||
description: | | ||
Keyword arguments to pass when instantiating the loader class. Check the documentation of | ||
the loader to check which arguments it accepts. | ||
type: str | ||
load_kwargs: | ||
description: | | ||
Keyword arguments to pass to the `.load()` method of the loader. Check the documentation of | ||
the loader to check which arguments it accepts. | ||
type: str | ||
additional_requirements: | ||
description: | | ||
Some loaders require additional dependencies to be installed. You can specify those here. | ||
Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately | ||
additional requirements for LlamaIndex loaders are not documented well, but if a dependency | ||
is missing, a clear error message will be thrown. | ||
type: list | ||
default: [] | ||
n_rows_to_load: | ||
description: | | ||
Optional argument that defines the number of rows to load. Useful for testing pipeline runs | ||
on a small scale | ||
type: int | ||
default: None | ||
index_column: | ||
description: | | ||
Column to set index to in the load component, if not specified a default globally unique | ||
index will be set | ||
type: str | ||
default: None |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
[pytest] | ||
pythonpath = ../src |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
pytest==7.4.2 |