From e87f490a0ee82bfa96f0f9ff3a0b1bc19d825b47 Mon Sep 17 00:00:00 2001 From: Matthias Richter Date: Fri, 5 Apr 2024 09:54:54 +0200 Subject: [PATCH] Update dataset documentation (#918) First part of documentation changes related to the new dataset interface. --- docs/dataset.md | 181 +++++++++++++++++++++++++++++++++++++++++++ docs/pipeline.md | 195 ----------------------------------------------- 2 files changed, 181 insertions(+), 195 deletions(-) create mode 100644 docs/dataset.md delete mode 100644 docs/pipeline.md diff --git a/docs/dataset.md b/docs/dataset.md new file mode 100644 index 00000000..f56e0b37 --- /dev/null +++ b/docs/dataset.md @@ -0,0 +1,181 @@ +# Dataset + +Fondant helps you build datasets by providing a set of operations to load, transform, +and write data. With Fondant, you can use both reusable components and custom components, +and chain them to create datasets. + +## Build a dataset + +Start by creating a `dataset.py` file and adding the following code. + +```python +from fondant.dataset import Dataset + +dataset = Dataset.create( + "load_from_parquet", + arguments={ + "dataset_uri": "path/to/dataset", + "n_rows_to_load": 100, + }, + produces={ + "text": pa.string() + }, + dataset_name="my_dataset" +) +``` + +This code initializes a `Dataset` instance with a load component. The load component reads data. + +??? "View a detailed reference of the `Dataset.create()` method" + + ::: fondant.dataset.Dataset.read + handler: python + options: + show_source: false + +The create method does not execute your component yet, but adds the component to the execution +graph. It returns a lazy `Dataset` instance which you can use to chain transform components. + +### Adding transform components + +```python +from fondant.dataset import Resources + +dataset = dataset.apply( + "embed_text", + resources=Resources( + accelerator_number=1, + accelerator_name="GPU", + ) +) +``` + +The `apply` method also returns a lazy `Dataset` which you can use to chain additional components. + +The `apply` method also provides additional configuration options on how to execute the component. +You can for instance provide a `Resources` definition to specify the hardware it should run on. +In this case, we want to leverage a GPU to run our embedding model. Depending on the runner, you +can choose the type of GPU as well. + +[//]: # (TODO: Add section on Resources or a general API section) + +??? "View a detailed reference of the `Dataset.apply()` method" + + ::: fondant.dataset.dataset.Dataset.apply + handler: python + options: + show_source: false + +### Adding a write component + +The final step is to write our data to its destination. + +```python +dataset = dataset.write( + "write_to_hf_hub", + arguments={ + "username": "user", + "dataset_name": "dataset", + "hf_token": "xxx", + } +) +``` + +??? "View a detailed reference of the `Dataset.write()` method" + + ::: fondant.dataset.dataset.Dataset.write + handler: python + options: + show_source: false + +[//]: # (TODO: Add info on mapping fields between components) + +## Materialize the dataset + +Once all your components are added to your dataset you can use different runners to materialize the +your dataset. + +!!! note "IMPORTANT" + + When using other runners you will need to make sure that your new environment has access to: + + - The base path of your pipeline (as mentioned above) + - The images used in your pipeline (make sure you have access to the registries where the images are + stored) + +=== "Console" + + === "Local" + + ```bash + fondant run local --working_directory + ``` + === "Vertex" + + ```bash + fondant run vertex \ + --project-id $PROJECT_ID \ + --project-region $PROJECT_REGION \ + --service-account $SERVICE_ACCOUNT \ + --working_directory + ``` + === "SageMaker" + + ```bash + fondant run sagemaker \ + --role-arn \ + --working_directory + ``` + === "Kubeflow" + + ```bash + fondant run kubeflow --working_directory + ``` + +=== "Python" + + === "Local" + + ```python + from fondant.dataset.runner import DockerRunner + + runner = DockerRunner() + runner.run(input=, working_directory=) + ``` + === "Vertex" + + ```python + from fondant.dataset.runner import VertexRunner + + runner = VertexRunner() + runner.run(input=, working_directory=) + ``` + === "SageMaker" + + ```python + from fondant.dataset.runner import SageMakerRunner + + runner = SageMakerRunner() + runner.run(input=,role_arn=, + working_directory=) + ``` + === "KubeFlow" + + ```python + from fondant.dataset.runner import KubeFlowRunner + + runner = KubeFlowRunner(host=) + runner.run(input=) + ``` + + The dataset ref can be a reference to the file containing your dataset, a variable + containing your dataset, or a factory function that will create your dataset. + + The working directory can be: + - **A remote cloud location (S3, GCS, Azure Blob storage):** + For the local runner, make sure that your local credentials or service account have read/write + access to the designated working directory and that you provide them to the dataset. + For the Vertex, Sagemaker, and Kubeflow runners, make sure that the service account + attached to those runners has read/write access. + - **A local directory:** only valid for the local runner, points to a local directory. + This is useful for local development. diff --git a/docs/pipeline.md b/docs/pipeline.md deleted file mode 100644 index bf4b6c19..00000000 --- a/docs/pipeline.md +++ /dev/null @@ -1,195 +0,0 @@ -# Dataset - -A Fondant Dataset is a checkpoint in a Directed Acyclic Graph -(DAG) of one or more different components that need to be executed. With Fondant, you can use both reusable -components and custom components, and chain them together. - - -[//]: # (TODO update this section once we have the workspace) -## Composing a Pipeline - -Start by creating a `pipeline.py` file and adding the following code. -```python -from fondant.dataset import Dataset - -#dataset = Dataset.read( -# .. -#) - -``` - -We identify our pipeline with a name and provide a base path where the pipeline will store its -data and artifacts. - -The base path can be: - -* **A remote cloud location (S3, GCS, Azure Blob storage)**: - For the **local runner**, make sure that your local credentials or service account have read/write - access to the designated base path and that you provide them to the pipeline. - For the **Vertex**, **Sagemaker**, and **Kubeflow** runners, make sure that the service account - attached to those runners has read/write access. -* **A local directory**: only valid for the local runner, points to a local directory. This is - useful for local development. - - -### Adding a load component - -You can read data into your pipeline by using the `Dataset.read()` method with a load component. - -```python -dataset = Dataset.read( - "load_from_parquet", - arguments={ - "dataset_uri": "path/to/dataset", - "n_rows_to_load": 100, - }, -) -``` -[//]: # (TODO: Add example of init from manifest) - -??? "View a detailed reference of the `Dataset.read()` method" - - ::: fondant.dataset.Dataset.read - handler: python - options: - show_source: false - -The read method does not execute your component yet, but adds the component to the pipeline -graph. It returns a lazy `Dataset` instance which you can use to chain transform components. - -### Adding transform components - -```python -from fondant.dataset import Resources - -dataset = dataset.apply( - "embed_text", - resources=Resources( - accelerator_number=1, - accelerator_name="GPU", - ) -) -``` - -The `apply` method also returns a lazy `Dataset` which you can use to chain additional components. - -The `apply` method also provides additional configuration options on how to execute the component. -You can for instance provide a `Resources` definition to specify the hardware it should run on. -In this case, we want to leverage a GPU to run our embedding model. Depending on the runner, you -can choose the type of GPU as well. - -[//]: # (TODO: Add section on Resources or a general API section) - -??? "View a detailed reference of the `Dataset.apply()` method" - - ::: fondant.dataset.Dataset.apply - handler: python - options: - show_source: false - -### Adding a write component - -The final step is to write our data to its destination. - -```python -dataset = dataset.write( - "write_to_hf_hub", - arguments={ - "username": "user", - "dataset_name": "dataset", - "hf_token": "xxx", - } -) -``` - -??? "View a detailed reference of the `Dataset.write()` method" - - ::: fondant.dataset.Dataset.write - handler: python - options: - show_source: false - -!!! note "IMPORTANT" - - Currently Fondant supports linear DAGs with single dependencies. Support for non-linear DAGs - will be available in future releases. - -[//]: # (TODO: Add info on mapping fields between components) - -## Running a pipeline - -Once all your components are added to your pipeline you can use different runners to run your -pipeline. - -!!! note "IMPORTANT" - - When using other runners you will need to make sure that your new environment has access to: - - - The base path of your pipeline (as mentioned above) - - The images used in your pipeline (make sure you have access to the registries where the images are - stored) - -=== "Console" - - === "Local" - - ```bash - fondant run local - ``` - === "Vertex" - - ```bash - fondant run vertex \ - --project-id $PROJECT_ID \ - --project-region $PROJECT_REGION \ - --service-account $SERVICE_ACCOUNT - ``` - === "SageMaker" - - ```bash - fondant run sagemaker \ - --role-arn - ``` - === "Kubeflow" - - ```bash - fondant run kubeflow - ``` - -=== "Python" - - === "Local" - - ```python - from fondant.pipeline.runner import DockerRunner - - runner = DockerRunner() - runner.run(input=) - ``` - === "Vertex" - - ```python - from fondant.pipeline.runner import VertexRunner - - runner = VertexRunner() - runner.run(input=) - ``` - === "SageMaker" - - ```python - from fondant.pipeline.runner import SageMakerRunner - - runner = SageMakerRunner() - runner.run(input=, pipeline_name= role_arn=) - ``` - === "KubeFlow" - - ```python - from fondant.pipeline.runner import KubeFlowRunner - - runner = KubeFlowRunner(host=) - runner.run(input=) - ``` - - The pipeline ref can be a reference to the file containing your pipeline, a variable - containing your pipeline, or a factory function that will create your pipeline.