Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs]Document clarifying notes about the data lifecycle #5922

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 23 additions & 18 deletions docs/user_guide/concepts/main_concepts/data_management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,17 +159,6 @@ Between Tasks

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/9cb3d56d7f3b88622749b41ff7ad2d3ebce92726/flyte/concepts/data_movement/flyte_data_transfer.png


Bringing in Your Own Datastores for Raw Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Flytekit has a pluggable data persistence layer.
This is driven by PROTOCOL.
For example, it is theoretically possible to use S3 ``s3://`` for metadata and GCS ``gcs://`` for raw data. It is also possible to create your own protocol ``my_fs://``, to change how data is stored and accessed.
But for Metadata, the data should be accessible to Flyte control plane.

Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.

Practical Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this example connects with the content immediately above. This is an example of data movement, which is good, so maybe it would be better placed at the beginning of the page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @davidmirror-ops , I think you are right. The example section in this PR is indeed more relevant to data movement. To reduce ambiguity, I’ve moved the example section directly below the data movement section and relocated the Own Datastores section to the bottom of the page. Let me know if you have any further suggestions, thanks! doc-link

~~~~~~~~~~~~~~~~~

Expand All @@ -180,19 +169,18 @@ The first task reads a file from the object store, shuffles the data, saves to l
.. code-block:: python

@task()
def task_remove_column(input_file: FlyteFile, column_name: str) -> FlyteFile:
def task_read_and_shuffle_file(input_file: FlyteFile) -> FlyteFile:
"""
Reads the input file as a DataFrame, removes a specified column, and outputs it as a new file.
Reads the input file as a DataFrame, shuffles the rows, and writes the shuffled DataFrame to a new file.
"""
input_file.download()
df = pd.read_csv(input_file.path)

# remove column
if column_name in df.columns:
df = df.drop(columns=[column_name])
# Shuffle the DataFrame rows
shuffled_df = df.sample(frac=1).reset_index(drop=True)

output_file_path = "data_finished.csv"
df.to_csv(output_file_path, index=False)
output_file_path = "data_shuffle.csv"
shuffled_df.to_csv(output_file_path, index=False)

return FlyteFile(output_file_path)
...
Expand Down Expand Up @@ -241,3 +229,20 @@ First task output metadata:
Second task input metadata:

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/9cb3d56d7f3b88622749b41ff7ad2d3ebce92726/flyte/concepts/data_movement/flyte_data_movement_example_input.png

Bringing in Your Own Datastores for Raw Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Flytekit has a pluggable data persistence layer.
This is driven by PROTOCOL.
For example, it is theoretically possible to use S3 ``s3://`` for metadata and GCS ``gcs://`` for raw data. It is also possible to create your own protocol ``my_fs://``, to change how data is stored and accessed.
But for Metadata, the data should be accessible to Flyte control plane.

Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.

Deleting Raw Data in Your Own Datastores
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Flyte does not offer a direct function to delete raw data stored in external datastores like ``S3`` or ``GCS``. However, you can manage deletion by configuring a lifecycle policy within your datastore service.

If caching is enabled in your Flyte ``task``, ensure that the ``max-cache-age`` is set to be shorter than the lifecycle policy in your datastore to prevent potential data inconsistency issues.
Loading