-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create simple storage writer #826
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
dfafd91
Init component
mrchtr bb8702a
Add write to file component
mrchtr 870937b
Update docstrings
mrchtr 8eca440
Addressing comments
mrchtr 5269fb5
Merge branch 'main' into feature/create-simple-storage-writer
mrchtr 690c6ac
Merge branch 'main' into feature/create-simple-storage-writer
mrchtr c402d46
Addressing comments
mrchtr c058164
Update components/write_to_file/fondant_component.yaml
mrchtr 270e01f
Update components/write_to_file/src/main.py
mrchtr ac2ea83
Update components/write_to_file/fondant_component.yaml
mrchtr c15c6b6
Addressing comments
mrchtr c2caa3f
Merge branch 'main' into feature/create-simple-storage-writer
mrchtr f3dfb89
Addressing comments
mrchtr db49966
Addressing comments
mrchtr 848fb7f
Merge branch 'main' into feature/create-simple-storage-writer
mrchtr 2b2e196
Update readme
mrchtr 5f2d741
Addressing comments
mrchtr c4ee2ff
Addressing comments
mrchtr d4902b8
Addressing comments
mrchtr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
FROM --platform=linux/amd64 python:3.10-slim as base | ||
|
||
# System dependencies | ||
RUN apt-get update && \ | ||
apt-get upgrade -y && \ | ||
apt-get install git -y | ||
|
||
# Install requirements | ||
COPY requirements.txt / | ||
RUN pip3 install --no-cache-dir -r requirements.txt | ||
|
||
# Install Fondant | ||
# This is split from other requirements to leverage caching | ||
ARG FONDANT_VERSION=main | ||
RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION} | ||
|
||
# Set the working directory to the component folder | ||
WORKDIR /component | ||
COPY src/ src/ | ||
|
||
FROM base as test | ||
COPY tests/ tests/ | ||
RUN pip3 install --no-cache-dir -r tests/requirements.txt | ||
RUN python -m pytest tests | ||
|
||
FROM base | ||
WORKDIR /component/src | ||
ENTRYPOINT ["fondant", "execute", "main"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# Write to file | ||
|
||
<a id="write_to_file#description"></a> | ||
## Description | ||
A Fondant component to write a dataset to file on a local machine or to a cloud storage bucket. The dataset can be written as csv or parquet. | ||
|
||
<a id="write_to_file#inputs_outputs"></a> | ||
## Inputs / outputs | ||
|
||
<a id="write_to_file#consumes"></a> | ||
### Consumes | ||
|
||
**This component can consume additional fields** | ||
- <field_name>: <dataset_field_name> | ||
This defines a mapping to update the fields consumed by the operation as defined in the component spec. | ||
The keys are the names of the fields to be received by the component, while the values are | ||
the name of the field to map from the input dataset | ||
|
||
See the usage example below on how to define a field name for additional fields. | ||
|
||
|
||
|
||
|
||
<a id="write_to_file#produces"></a> | ||
### Produces | ||
|
||
|
||
**This component does not produce data.** | ||
|
||
<a id="write_to_file#arguments"></a> | ||
## Arguments | ||
|
||
The component takes the following arguments to alter its behavior: | ||
|
||
| argument | type | description | default | | ||
| -------- | ---- | ----------- | ------- | | ||
| path | str | Path to store the dataset, whether it's a local path or a cloud storage bucket, must be specified. A separate filename will be generated for each partition. If you are using the local runner and export the data to a local directory, ensure that you mount the path to the directory using the `--extra-volumes` argument. | / | | ||
| format | str | Format for storing the dataframe can be either `csv` or `parquet`. As default `parquet` is used. The CSV files contain the column as a header and use a comma as a delimiter. | parquet | | ||
|
||
<a id="write_to_file#usage"></a> | ||
## Usage | ||
|
||
You can add this component to your pipeline using the following code: | ||
|
||
```python | ||
from fondant.pipeline import Pipeline | ||
|
||
|
||
pipeline = Pipeline(...) | ||
|
||
dataset = pipeline.read(...) | ||
|
||
dataset = dataset.apply(...) | ||
|
||
dataset.write( | ||
"write_to_file", | ||
arguments={ | ||
# Add arguments | ||
# "path": , | ||
# "format": "parquet", | ||
}, | ||
consumes={ | ||
<field_name>: <dataset_field_name>, | ||
..., # Add fields | ||
}, | ||
) | ||
``` | ||
|
||
<a id="write_to_file#testing"></a> | ||
## Testing | ||
|
||
You can run the tests using docker with BuildKit. From this directory, run: | ||
``` | ||
docker build . --target test | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
name: Write to file | ||
description: >- | ||
A Fondant component to write a dataset to file on a local machine or to a cloud storage bucket. | ||
The dataset can be written as csv or parquet. | ||
image: 'fndnt/write_to_file:dev' | ||
tags: | ||
- Data writing | ||
|
||
consumes: | ||
additionalProperties: true | ||
|
||
args: | ||
path: | ||
description: >- | ||
Path to store the dataset, whether it's a local path or a cloud storage bucket, | ||
must be specified. A separate filename will be generated for each partition. | ||
mrchtr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
If you are using the local runner and export the data to a local directory, | ||
ensure that you mount the path to the directory using the `--extra-volumes` argument. | ||
type: str | ||
format: | ||
description: >- | ||
Format for storing the dataframe can be either `csv` or `parquet`. As default | ||
`parquet` is used. | ||
The CSV files contain the column as a header and use a comma as a delimiter. | ||
type: str | ||
default: parquet |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
import dask.dataframe as dd | ||
from fondant.component import DaskWriteComponent | ||
|
||
|
||
class WriteToFile(DaskWriteComponent): | ||
def __init__(self, *, path: str, format: str): | ||
"""Initialize the write to file component.""" | ||
self.path = path | ||
self.format = format | ||
|
||
def write(self, dataframe: dd.DataFrame) -> None: | ||
""" | ||
Writes the data from the given Dask DataFrame to a file either locally or | ||
to a remote storage bucket. | ||
|
||
Args: | ||
dataframe (dd.DataFrame): The Dask DataFrame containing the data to be written. | ||
""" | ||
if self.format.lower() == "csv": | ||
mrchtr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.path = self.path + "/export-*.csv" | ||
dataframe.to_csv(self.path) | ||
elif self.format.lower() == "parquet": | ||
dataframe.to_parquet(self.path) | ||
else: | ||
msg = ( | ||
f"Not supported file format {self.format}. Writing to file is only " | ||
f"supported for `csv` and `parquet`." | ||
) | ||
raise ValueError(msg) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
[pytest] | ||
pythonpath = ../src |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
pytest==7.4.2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
import tempfile | ||
|
||
import dask.dataframe as dd | ||
import pandas as pd | ||
|
||
from src.main import WriteToFile | ||
|
||
|
||
def test_write_to_csv(): | ||
"""Test case for write to file component.""" | ||
with tempfile.TemporaryDirectory() as tmpdir: | ||
entries = 10 | ||
|
||
dask_dataframe = dd.DataFrame.from_dict( | ||
{ | ||
"text": [ | ||
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo", | ||
] | ||
* entries, | ||
}, | ||
npartitions=1, | ||
) | ||
|
||
component = WriteToFile( | ||
path=tmpdir, | ||
format="csv", | ||
) | ||
|
||
component.write(dask_dataframe) | ||
|
||
df = pd.read_csv(tmpdir + "/export-0.csv") | ||
assert len(df) == entries | ||
|
||
|
||
def test_write_to_parquet(): | ||
"""Test case for write to file component.""" | ||
with tempfile.TemporaryDirectory() as tmpdir: | ||
entries = 10 | ||
|
||
dask_dataframe = dd.DataFrame.from_dict( | ||
{ | ||
"text": [ | ||
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo", | ||
] | ||
* entries, | ||
}, | ||
npartitions=1, | ||
) | ||
|
||
component = WriteToFile( | ||
path=tmpdir, | ||
format="parquet", | ||
) | ||
|
||
component.write(dask_dataframe) | ||
|
||
ddf = dd.read_parquet(tmpdir) | ||
assert len(ddf) == entries |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct bit it might confuse users.
Maybe word it like:
This component does not produce a dataframe for downstream processing.