Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create simple storage writer #826

Merged
merged 19 commits into from
Feb 2, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions components/write_to_file/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
FROM --platform=linux/amd64 python:3.8-slim as base
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to start using newer python versions since 3.8 will reach end-of-life before you know it. (https://devguide.python.org/versions/)


# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/

FROM base as test
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]
21 changes: 21 additions & 0 deletions components/write_to_file/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Write to file
description: >-
A Fondant component to write a dataset to file on your local machine or to a cloud storage bucket.
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
The dataset can be written as csv or parquet.
image: 'fndnt/write_to_file:dev'
tags:
- Data writing

consumes:
additionalProperties: true

args:
path:
description: >-
Path to store the dataset, whether it's a local path or a cloud storage bucket,
must be specified. A separate filename will be generated for each partition.
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
type: str
format:
description: Format for storing the dataframe can be either csv or parquet.
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
type: str
default: csv
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
Empty file.
29 changes: 29 additions & 0 deletions components/write_to_file/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import dask.dataframe as dd
from fondant.component import DaskWriteComponent


class WriteToFile(DaskWriteComponent):
def __init__(self, *, path: str, format: str):
"""Initialize the write to file component."""
self.path = path
self.format = format

def write(self, dataframe: dd.DataFrame) -> None:
"""
Writes the data from the given Dask DataFrame to a file either locally or
to a remote storage bucket.

Args:
dataframe (dd.DataFrame): The Dask DataFrame containing the data to be written.
"""
if self.format.lower() == "csv":
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
self.path = self.path + "/export-*.csv"
dataframe.to_csv(self.path)
elif self.format.lower() == "parquet":
dataframe.to_parquet(self.path)
else:
msg = (
f"Not supported file format {self.format}. Writing to file is only "
f"supported for csv and parquet."
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
)
raise ValueError(msg)
2 changes: 2 additions & 0 deletions components/write_to_file/tests/pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[pytest]
pythonpath = ../src
1 change: 1 addition & 0 deletions components/write_to_file/tests/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pytest==7.4.2
58 changes: 58 additions & 0 deletions components/write_to_file/tests/write_to_file_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import tempfile

import dask.dataframe as dd
import pandas as pd

from src.main import WriteToFile


def test_write_to_csv():
"""Test case for write to file component."""
with tempfile.TemporaryDirectory() as tmpdir:
entries = 10

dask_dataframe = dd.DataFrame.from_dict(
{
"text": [
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo",
]
* entries,
},
npartitions=1,
)

component = WriteToFile(
path=tmpdir,
format="csv",
)

component.write(dask_dataframe)

df = pd.read_csv(tmpdir + "/export-0.csv")
assert len(df) == entries


def test_write_to_parquet():
"""Test case for write to file component."""
with tempfile.TemporaryDirectory() as tmpdir:
entries = 10

dask_dataframe = dd.DataFrame.from_dict(
{
"text": [
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo",
]
* entries,
},
npartitions=1,
)

component = WriteToFile(
path=tmpdir,
format="parquet",
)

component.write(dask_dataframe)

ddf = dd.read_parquet(tmpdir)
assert len(ddf) == entries
Loading