Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create simple storage writer #826

Merged
merged 19 commits into from
Feb 2, 2024
Merged

Conversation

mrchtr
Copy link
Contributor

@mrchtr mrchtr commented Jan 30, 2024

Write component that writes a dask dataframe to a file, either csv or parquet.

Fix #824

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrchtr!

Could you add it to the integration test as well?

@RobbeSneyders
Copy link
Member

The component is missing a README which makes the docs build fail. Not sure why this is not caught by the pre-commit hook?

Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matthias! Left a few comments

components/write_to_file/fondant_component.yaml Outdated Show resolved Hide resolved
components/write_to_file/fondant_component.yaml Outdated Show resolved Hide resolved
components/write_to_file/src/main.py Show resolved Hide resolved
components/write_to_file/src/main.py Outdated Show resolved Hide resolved
@mrchtr
Copy link
Contributor Author

mrchtr commented Jan 31, 2024

The component is missing a README which makes the docs build fail. Not sure why this is not caught by the pre-commit hook?

The readme was generated, but I have forgotten to push it.

Copy link
Collaborator

@GeorgesLorre GeorgesLorre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@@ -0,0 +1,28 @@
FROM --platform=linux/amd64 python:3.8-slim as base
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to start using newer python versions since 3.8 will reach end-of-life before you know it. (https://devguide.python.org/versions/)

### Produces


**This component does not produce data.**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct bit it might confuse users.

Maybe word it like: This component does not produce a dataframe for downstream processing.

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| path | str | Path to store the dataset, whether it's a local path or a cloud storage bucket, must be specified. A separate filename will be generated for each partition. If you are using the local runner and export the data to a local directory, ensure that you mount the path to the directory using the `--extra-volumes` argument. | / |
| format | str | Format for storing the dataframe can be either `csv` or `parquet`. | csv |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the CSV write headers ? Does it use comma as a delimiter ? Should we allow for specifying this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV is written with headers and comma separated. I'll add to the description.
I would not make this configurable. I see this one more as an example component. For more custom export methods I would recommend implementing a python component instead, no?

components/write_to_file/fondant_component.yaml Outdated Show resolved Hide resolved
examples/sample_pipeline/README.md Show resolved Hide resolved
@GeorgesLorre
Copy link
Collaborator

Wondering: did you test it with a remote path ?

@mrchtr
Copy link
Contributor Author

mrchtr commented Feb 2, 2024

Wondering: did you test it with a remote path ?

I've tested it quickly and it is working :)

Copy link
Collaborator

@GeorgesLorre GeorgesLorre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mrchtr mrchtr merged commit c4cd3d3 into main Feb 2, 2024
11 checks passed
@mrchtr mrchtr deleted the feature/create-simple-storage-writer branch February 2, 2024 13:23
RobbeSneyders added a commit that referenced this pull request Feb 5, 2024
Added `write_to_file` to both getting started guides. #826 already added
the component to the example pipeline.

fix #825

---------

Co-authored-by: Robbe Sneyders <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create simple storage writer
4 participants