Skip to content

Commit

Permalink
Add load_with_llamahub component (#719)
Browse files Browse the repository at this point in the history
This component is being ported from the use case repo
[here](https://github.com/ml6team/fondant-usecase-RAG/tree/main/src/components/load_with_llamahub).

I split the inclusion and the changes I needed to make in separate
commits so it's easy to review.
  • Loading branch information
RobbeSneyders authored Dec 12, 2023
1 parent 589e327 commit 13ba6dc
Show file tree
Hide file tree
Showing 10 changed files with 332 additions and 1 deletion.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,5 +68,5 @@ repos:
name: Generate component READMEs
language: python
entry: python scripts/component_readme/generate_readme.py
files: ^components/.*/fondant_component.yaml
files: ^components/[^/]*/fondant_component.yaml
additional_dependencies: ["fondant@git+https://github.com/ml6team/fondant@main", "Jinja2==3.1.2"]
29 changes: 29 additions & 0 deletions components/load_with_llamahub/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM --platform=linux/amd64 python:3.8-slim as base

# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/

FROM base as test
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]

56 changes: 56 additions & 0 deletions components/load_with_llamahub/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Load with LlamaHub

### Description
Load data using a LlamaHub loader. For available loaders, check the
[LlamaHub](https://llamahub.ai/).


### Inputs / outputs

**This component consumes no data.**

**This component produces no data.**

### Arguments

The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| loader_class | str | The name of the LlamaIndex loader class to use. Make sure to provide the name and not the id. The name is passed to `llama_index.download_loader` to download the specified loader. | / |
| loader_kwargs | str | Keyword arguments to pass when instantiating the loader class. Check the documentation of the loader to check which arguments it accepts. | / |
| load_kwargs | str | Keyword arguments to pass to the `.load()` method of the loader. Check the documentation ofthe loader to check which arguments it accepts. | / |
| additional_requirements | list | Some loaders require additional dependencies to be installed. You can specify those here. Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately additional requirements for LlamaIndex loaders are not documented well, but if a dependencyis missing, a clear error message will be thrown. | / |
| n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale | / |
| index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / |

### Usage

You can add this component to your pipeline using the following code:

```python
from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(
"load_with_llamahub",
arguments={
# Add arguments
# "loader_class": ,
# "loader_kwargs": ,
# "load_kwargs": ,
# "additional_requirements": [],
# "n_rows_to_load": 0,
# "index_column": ,
}
)
```

### Testing

You can run the tests using docker with BuildKit. From this directory, run:
```
docker build . --target test
```
47 changes: 47 additions & 0 deletions components/load_with_llamahub/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: Load with LlamaHub
description: |
Load data using a LlamaHub loader. For available loaders, check the
[LlamaHub](https://llamahub.ai/).
image: fndnt/load_with_llamahub:dev
tags:
- Data loading

produces:
additionalProperties: true

args:
loader_class:
description: |
The name of the LlamaIndex loader class to use. Make sure to provide the name and not the
id. The name is passed to `llama_index.download_loader` to download the specified loader.
type: str
loader_kwargs:
description: |
Keyword arguments to pass when instantiating the loader class. Check the documentation of
the loader to check which arguments it accepts.
type: str
load_kwargs:
description: |
Keyword arguments to pass to the `.load()` method of the loader. Check the documentation of
the loader to check which arguments it accepts.
type: str
additional_requirements:
description: |
Some loaders require additional dependencies to be installed. You can specify those here.
Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately
additional requirements for LlamaIndex loaders are not documented well, but if a dependency
is missing, a clear error message will be thrown.
type: list
default: []
n_rows_to_load:
description: |
Optional argument that defines the number of rows to load. Useful for testing pipeline runs
on a small scale
type: int
default: None
index_column:
description: |
Column to set index to in the load component, if not specified a default globally unique
index will be set
type: str
default: None
1 change: 1 addition & 0 deletions components/load_with_llamahub/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
llama-index==0.9.9
110 changes: 110 additions & 0 deletions components/load_with_llamahub/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import logging
import subprocess
import sys
import typing as t
from collections import defaultdict

import dask.dataframe as dd
import pandas as pd
from fondant.component import DaskLoadComponent
from fondant.core.component_spec import ComponentSpec
from llama_index import download_loader

logger = logging.getLogger(__name__)


class LlamaHubReader(DaskLoadComponent):
def __init__(
self,
spec: ComponentSpec,
*,
loader_class: str,
loader_kwargs: dict,
load_kwargs: dict,
additional_requirements: t.List[str],
n_rows_to_load: t.Optional[int] = None,
index_column: t.Optional[str] = None,
) -> None:
"""
Args:
spec: the component spec
loader_class: The name of the LlamaIndex loader class to use
loader_kwargs: Keyword arguments to pass when instantiating the loader class
load_kwargs: Keyword arguments to pass to the `.load()` method of the loader
additional_requirements: Additional Python requirements to install
n_rows_to_load: optional argument that defines the number of rows to load.
Useful for testing pipeline runs on a small scale.
index_column: Column to set index to in the load component, if not specified a default
globally unique index will be set.
"""
self.n_rows_to_load = n_rows_to_load
self.index_column = index_column
self.spec = spec

self.install_additional_requirements(additional_requirements)

loader_cls = download_loader(loader_class)
self.loader = loader_cls(**loader_kwargs)
self.load_kwargs = load_kwargs

@staticmethod
def install_additional_requirements(additional_requirements: t.List[str]):
for requirement in additional_requirements:
subprocess.check_call( # nosec
[sys.executable, "-m", "pip", "install", requirement],
)

def set_df_index(self, dask_df: dd.DataFrame) -> dd.DataFrame:
if self.index_column is None:
logger.info(
"Index column not specified, setting a globally unique index",
)

def _set_unique_index(dataframe: pd.DataFrame, partition_info=None):
"""Function that sets a unique index based on the partition and row number."""
dataframe["id"] = 1
dataframe["id"] = (
str(partition_info["number"])
+ "_"
+ (dataframe.id.cumsum()).astype(str)
)
dataframe.index = dataframe.pop("id")
return dataframe

def _get_meta_df() -> pd.DataFrame:
meta_dict = {"id": pd.Series(dtype="object")}
for field_name, field in self.spec.produces.items():
meta_dict[field_name] = pd.Series(
dtype=pd.ArrowDtype(field.type.value),
)
return pd.DataFrame(meta_dict).set_index("id")

meta = _get_meta_df()
dask_df = dask_df.map_partitions(_set_unique_index, meta=meta)
else:
logger.info(f"Setting `{self.index_column}` as index")
dask_df = dask_df.set_index(self.index_column, drop=True)

return dask_df

def load(self) -> dd.DataFrame:
try:
documents = self.loader.lazy_load_data(**self.load_kwargs)
except NotImplementedError:
documents = self.loader.load_data(**self.load_kwargs)

doc_dict = defaultdict(list)
for d, document in enumerate(documents):
for column in self.spec.produces:
if column == "text":
doc_dict["text"].append(document.text)
else:
doc_dict[column].append(document.metadata.get(column))

if d == self.n_rows_to_load:
break

dask_df = dd.from_dict(doc_dict, npartitions=1)

dask_df = self.set_df_index(dask_df)
return dask_df
35 changes: 35 additions & 0 deletions components/load_with_llamahub/tests/component_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from pathlib import Path

import yaml
from fondant.core.component_spec import ComponentSpec

from src.main import LlamaHubReader


def test_arxiv_reader():
"""Test the component with the ArxivReader.
This test requires a stable internet connection, both to download the loader, and to download
the papers from Arxiv.
"""
with open(Path(__file__).with_name("fondant_component.yaml")) as f:
spec = yaml.safe_load(f)
spec = ComponentSpec(spec)

component = LlamaHubReader(
spec=spec,
loader_class="ArxivReader",
loader_kwargs={},
load_kwargs={
"search_query": "jeff dean",
"max_results": 5,
},
additional_requirements=["pypdf"],
n_rows_to_load=None,
index_column=None,
)

output_dataframe = component.load().compute()

assert len(output_dataframe) > 0
assert output_dataframe.columns.tolist() == ["text", "URL", "Title of this paper"]
50 changes: 50 additions & 0 deletions components/load_with_llamahub/tests/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: Load with LlamaHub
description: |
Load data using a LlamaHub loader. For available loaders, check the
[LlamaHub](https://llamahub.ai/).
image: ghcr.io/ml6team/load_with_llamahub:dev

produces:
text:
type: string
URL:
type: string
Title of this paper:
type: string

args:
loader_class:
description: |
The name of the LlamaIndex loader class to use. Make sure to provide the name and not the
id. The name is passed to `llama_index.download_loader` to download the specified loader.
type: str
loader_kwargs:
description: |
Keyword arguments to pass when instantiating the loader class. Check the documentation of
the loader to check which arguments it accepts.
type: str
load_kwargs:
description: |
Keyword arguments to pass to the `.load()` method of the loader. Check the documentation of
the loader to check which arguments it accepts.
type: str
additional_requirements:
description: |
Some loaders require additional dependencies to be installed. You can specify those here.
Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately
additional requirements for LlamaIndex loaders are not documented well, but if a dependency
is missing, a clear error message will be thrown.
type: list
default: []
n_rows_to_load:
description: |
Optional argument that defines the number of rows to load. Useful for testing pipeline runs
on a small scale
type: int
default: None
index_column:
description: |
Column to set index to in the load component, if not specified a default globally unique
index will be set
type: str
default: None
2 changes: 2 additions & 0 deletions components/load_with_llamahub/tests/pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[pytest]
pythonpath = ../src
1 change: 1 addition & 0 deletions components/load_with_llamahub/tests/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pytest==7.4.2

0 comments on commit 13ba6dc

Please sign in to comment.