Add load_with_llamahub component (#719)

This component is being ported from the use case repo [here](https://github.com/ml6team/fondant-usecase-RAG/tree/main/src/components/load_with_llamahub). I split the inclusion and the changes I needed to make in separate commits so it's easy to review.
ml6team · Dec 12, 2023 · 13ba6dc · 13ba6dc
1 parent 589e327
commit 13ba6dc
Show file tree

Hide file tree

Showing 10 changed files with 332 additions and 1 deletion.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -68,5 +68,5 @@ repos:
         name: Generate component READMEs
         language: python
         entry: python scripts/component_readme/generate_readme.py
-        files: ^components/.*/fondant_component.yaml
+        files: ^components/[^/]*/fondant_component.yaml
         additional_dependencies: ["fondant@git+https://github.com/ml6team/fondant@main", "Jinja2==3.1.2"]
diff --git a/components/load_with_llamahub/Dockerfile b/components/load_with_llamahub/Dockerfile
@@ -0,0 +1,29 @@
+FROM --platform=linux/amd64 python:3.8-slim as base
+
+# System dependencies
+RUN apt-get update && \
+    apt-get upgrade -y && \
+    apt-get install git -y
+
+# Install requirements
+COPY requirements.txt /
+RUN pip3 install --no-cache-dir -r requirements.txt
+
+# Install Fondant
+# This is split from other requirements to leverage caching
+ARG FONDANT_VERSION=main
+RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}
+
+# Set the working directory to the component folder
+WORKDIR /component
+COPY src/ src/
+
+FROM base as test
+COPY tests/ tests/
+RUN pip3 install --no-cache-dir -r tests/requirements.txt
+RUN python -m pytest tests
+
+FROM base
+WORKDIR /component/src
+ENTRYPOINT ["fondant", "execute", "main"]
+
diff --git a/components/load_with_llamahub/README.md b/components/load_with_llamahub/README.md
@@ -0,0 +1,56 @@
+# Load with LlamaHub
+
+### Description
+Load data using a LlamaHub loader. For available loaders, check the 
+[LlamaHub](https://llamahub.ai/).
+
+
+### Inputs / outputs
+
+**This component consumes no data.**
+
+**This component produces no data.**
+
+### Arguments
+
+The component takes the following arguments to alter its behavior:
+
+| argument | type | description | default |
+| -------- | ---- | ----------- | ------- |
+| loader_class | str | The name of the LlamaIndex loader class to use. Make sure to provide the name and not the id. The name is passed to `llama_index.download_loader` to download the specified loader. | / |
+| loader_kwargs | str | Keyword arguments to pass when instantiating the loader class. Check the documentation of the loader to check which arguments it accepts. | / |
+| load_kwargs | str | Keyword arguments to pass to the `.load()` method of the loader. Check the documentation ofthe loader to check which arguments it accepts. | / |
+| additional_requirements | list | Some loaders require additional dependencies to be installed. You can specify those here. Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately additional requirements for LlamaIndex loaders are not documented well, but if a dependencyis missing, a clear error message will be thrown. | / |
+| n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale | / |
+| index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / |
+
+### Usage
+
+You can add this component to your pipeline using the following code:
+
+```python
+from fondant.pipeline import Pipeline
+
+
+pipeline = Pipeline(...)
+
+dataset = pipeline.read(
+    "load_with_llamahub",
+    arguments={
+        # Add arguments
+        # "loader_class": ,
+        # "loader_kwargs": ,
+        # "load_kwargs": ,
+        # "additional_requirements": [],
+        # "n_rows_to_load": 0,
+        # "index_column": ,
+    }
+)
+```
+
+### Testing
+
+You can run the tests using docker with BuildKit. From this directory, run:
+```
+docker build . --target test
+```
diff --git a/components/load_with_llamahub/fondant_component.yaml b/components/load_with_llamahub/fondant_component.yaml
@@ -0,0 +1,47 @@
+name: Load with LlamaHub
+description: |
+  Load data using a LlamaHub loader. For available loaders, check the 
+  [LlamaHub](https://llamahub.ai/).
+image: fndnt/load_with_llamahub:dev
+tags:
+  - Data loading
+
+produces:
+  additionalProperties: true
+
+args:
+  loader_class:
+    description: |
+      The name of the LlamaIndex loader class to use. Make sure to provide the name and not the 
+      id. The name is passed to `llama_index.download_loader` to download the specified loader.
+    type: str
+  loader_kwargs:
+    description: |
+      Keyword arguments to pass when instantiating the loader class. Check the documentation of 
+      the loader to check which arguments it accepts.
+    type: str
+  load_kwargs:
+    description: |
+      Keyword arguments to pass to the `.load()` method of the loader. Check the documentation of
+      the loader to check which arguments it accepts.
+    type: str
+  additional_requirements:
+    description: |
+      Some loaders require additional dependencies to be installed. You can specify those here. 
+      Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately 
+      additional requirements for LlamaIndex loaders are not documented well, but if a dependency
+      is missing, a clear error message will be thrown.
+    type: list
+    default: []
+  n_rows_to_load:
+    description: |
+      Optional argument that defines the number of rows to load. Useful for testing pipeline runs 
+      on a small scale
+    type: int
+    default: None
+  index_column:
+    description: |
+      Column to set index to in the load component, if not specified a default globally unique 
+      index will be set
+    type: str
+    default: None
diff --git a/components/load_with_llamahub/requirements.txt b/components/load_with_llamahub/requirements.txt
@@ -0,0 +1 @@
+llama-index==0.9.9
diff --git a/components/load_with_llamahub/src/main.py b/components/load_with_llamahub/src/main.py
@@ -0,0 +1,110 @@
+import logging
+import subprocess
+import sys
+import typing as t
+from collections import defaultdict
+
+import dask.dataframe as dd
+import pandas as pd
+from fondant.component import DaskLoadComponent
+from fondant.core.component_spec import ComponentSpec
+from llama_index import download_loader
+
+logger = logging.getLogger(__name__)
+
+
+class LlamaHubReader(DaskLoadComponent):
+    def __init__(
+        self,
+        spec: ComponentSpec,
+        *,
+        loader_class: str,
+        loader_kwargs: dict,
+        load_kwargs: dict,
+        additional_requirements: t.List[str],
+        n_rows_to_load: t.Optional[int] = None,
+        index_column: t.Optional[str] = None,
+    ) -> None:
+        """
+        Args:
+            spec: the component spec
+            loader_class: The name of the LlamaIndex loader class to use
+            loader_kwargs: Keyword arguments to pass when instantiating the loader class
+            load_kwargs: Keyword arguments to pass to the `.load()` method of the loader
+            additional_requirements: Additional Python requirements to install
+            n_rows_to_load: optional argument that defines the number of rows to load.
+                Useful for testing pipeline runs on a small scale.
+            index_column: Column to set index to in the load component, if not specified a default
+                globally unique index will be set.
+        """
+        self.n_rows_to_load = n_rows_to_load
+        self.index_column = index_column
+        self.spec = spec
+
+        self.install_additional_requirements(additional_requirements)
+
+        loader_cls = download_loader(loader_class)
+        self.loader = loader_cls(**loader_kwargs)
+        self.load_kwargs = load_kwargs
+
+    @staticmethod
+    def install_additional_requirements(additional_requirements: t.List[str]):
+        for requirement in additional_requirements:
+            subprocess.check_call(  # nosec
+                [sys.executable, "-m", "pip", "install", requirement],
+            )
+
+    def set_df_index(self, dask_df: dd.DataFrame) -> dd.DataFrame:
+        if self.index_column is None:
+            logger.info(
+                "Index column not specified, setting a globally unique index",
+            )
+
+            def _set_unique_index(dataframe: pd.DataFrame, partition_info=None):
+                """Function that sets a unique index based on the partition and row number."""
+                dataframe["id"] = 1
+                dataframe["id"] = (
+                    str(partition_info["number"])
+                    + "_"
+                    + (dataframe.id.cumsum()).astype(str)
+                )
+                dataframe.index = dataframe.pop("id")
+                return dataframe
+
+            def _get_meta_df() -> pd.DataFrame:
+                meta_dict = {"id": pd.Series(dtype="object")}
+                for field_name, field in self.spec.produces.items():
+                    meta_dict[field_name] = pd.Series(
+                        dtype=pd.ArrowDtype(field.type.value),
+                    )
+                return pd.DataFrame(meta_dict).set_index("id")
+
+            meta = _get_meta_df()
+            dask_df = dask_df.map_partitions(_set_unique_index, meta=meta)
+        else:
+            logger.info(f"Setting `{self.index_column}` as index")
+            dask_df = dask_df.set_index(self.index_column, drop=True)
+
+        return dask_df
+
+    def load(self) -> dd.DataFrame:
+        try:
+            documents = self.loader.lazy_load_data(**self.load_kwargs)
+        except NotImplementedError:
+            documents = self.loader.load_data(**self.load_kwargs)
+
+        doc_dict = defaultdict(list)
+        for d, document in enumerate(documents):
+            for column in self.spec.produces:
+                if column == "text":
+                    doc_dict["text"].append(document.text)
+                else:
+                    doc_dict[column].append(document.metadata.get(column))
+
+            if d == self.n_rows_to_load:
+                break
+
+        dask_df = dd.from_dict(doc_dict, npartitions=1)
+
+        dask_df = self.set_df_index(dask_df)
+        return dask_df
diff --git a/components/load_with_llamahub/tests/component_test.py b/components/load_with_llamahub/tests/component_test.py
@@ -0,0 +1,35 @@
+from pathlib import Path
+
+import yaml
+from fondant.core.component_spec import ComponentSpec
+
+from src.main import LlamaHubReader
+
+
+def test_arxiv_reader():
+    """Test the component with the ArxivReader.
+
+    This test requires a stable internet connection, both to download the loader, and to download
+    the papers from Arxiv.
+    """
+    with open(Path(__file__).with_name("fondant_component.yaml")) as f:
+        spec = yaml.safe_load(f)
+    spec = ComponentSpec(spec)
+
+    component = LlamaHubReader(
+        spec=spec,
+        loader_class="ArxivReader",
+        loader_kwargs={},
+        load_kwargs={
+            "search_query": "jeff dean",
+            "max_results": 5,
+        },
+        additional_requirements=["pypdf"],
+        n_rows_to_load=None,
+        index_column=None,
+    )
+
+    output_dataframe = component.load().compute()
+
+    assert len(output_dataframe) > 0
+    assert output_dataframe.columns.tolist() == ["text", "URL", "Title of this paper"]
diff --git a/components/load_with_llamahub/tests/fondant_component.yaml b/components/load_with_llamahub/tests/fondant_component.yaml
@@ -0,0 +1,50 @@
+name: Load with LlamaHub
+description: |
+  Load data using a LlamaHub loader. For available loaders, check the 
+  [LlamaHub](https://llamahub.ai/).
+image: ghcr.io/ml6team/load_with_llamahub:dev
+
+produces:
+  text:
+    type: string
+  URL:
+    type: string
+  Title of this paper:
+    type: string
+
+args:
+  loader_class:
+    description: |
+      The name of the LlamaIndex loader class to use. Make sure to provide the name and not the 
+      id. The name is passed to `llama_index.download_loader` to download the specified loader.
+    type: str
+  loader_kwargs:
+    description: |
+      Keyword arguments to pass when instantiating the loader class. Check the documentation of 
+      the loader to check which arguments it accepts.
+    type: str
+  load_kwargs:
+    description: |
+      Keyword arguments to pass to the `.load()` method of the loader. Check the documentation of
+      the loader to check which arguments it accepts.
+    type: str
+  additional_requirements:
+    description: |
+      Some loaders require additional dependencies to be installed. You can specify those here. 
+      Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately 
+      additional requirements for LlamaIndex loaders are not documented well, but if a dependency
+      is missing, a clear error message will be thrown.
+    type: list
+    default: []
+  n_rows_to_load:
+    description: |
+      Optional argument that defines the number of rows to load. Useful for testing pipeline runs 
+      on a small scale
+    type: int
+    default: None
+  index_column:
+    description: |
+      Column to set index to in the load component, if not specified a default globally unique 
+      index will be set
+    type: str
+    default: None
diff --git a/components/load_with_llamahub/tests/pytest.ini b/components/load_with_llamahub/tests/pytest.ini
@@ -0,0 +1,2 @@
+[pytest]
+pythonpath = ../src
diff --git a/components/load_with_llamahub/tests/requirements.txt b/components/load_with_llamahub/tests/requirements.txt
@@ -0,0 +1 @@
+pytest==7.4.2