From ccf509b56f21a2329a22a86f093afcb54caba48d Mon Sep 17 00:00:00 2001 From: Alex Wu Date: Sun, 27 Oct 2024 00:59:54 +0800 Subject: [PATCH 1/4] add information about deleting raw data in data_management.rst Signed-off-by: Alex Wu --- docs/user_guide/concepts/main_concepts/data_management.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/user_guide/concepts/main_concepts/data_management.rst b/docs/user_guide/concepts/main_concepts/data_management.rst index bc492a56f8..444dc10da2 100644 --- a/docs/user_guide/concepts/main_concepts/data_management.rst +++ b/docs/user_guide/concepts/main_concepts/data_management.rst @@ -170,6 +170,13 @@ But for Metadata, the data should be accessible to Flyte control plane. Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib. +Deleting Raw Data in Your Own Datastores +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Flyte does not offer a direct function to delete raw data stored in external datastores like ``S3`` or ``GCS``. However, you can manage deletion by configuring a lifecycle policy within your datastore service. + +If caching is enabled in your Flyte ``task``, ensure that the ``max-cache-age`` is set to be shorter than the lifecycle policy in your datastore to prevent potential data inconsistency issues. + Practical Example ~~~~~~~~~~~~~~~~~ From f086562a5e59a612da815a1b26b7d84a8e4d0781 Mon Sep 17 00:00:00 2001 From: Alex Wu Date: Sun, 27 Oct 2024 01:12:55 +0800 Subject: [PATCH 2/4] fix example code error Signed-off-by: Alex Wu --- .../concepts/main_concepts/data_management.rst | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/user_guide/concepts/main_concepts/data_management.rst b/docs/user_guide/concepts/main_concepts/data_management.rst index 444dc10da2..7f3a423780 100644 --- a/docs/user_guide/concepts/main_concepts/data_management.rst +++ b/docs/user_guide/concepts/main_concepts/data_management.rst @@ -186,20 +186,19 @@ The first task reads a file from the object store, shuffles the data, saves to l .. code-block:: python - @task() - def task_remove_column(input_file: FlyteFile, column_name: str) -> FlyteFile: + @task(container_image=basic_image, cache=True, cache_version="1.0") + def task_read_and_shuffle_file(input_file: FlyteFile) -> FlyteFile: """ - Reads the input file as a DataFrame, removes a specified column, and outputs it as a new file. + Reads the input file as a DataFrame, shuffles the rows, and writes the shuffled DataFrame to a new file. """ input_file.download() df = pd.read_csv(input_file.path) - # remove column - if column_name in df.columns: - df = df.drop(columns=[column_name]) + # Shuffle the DataFrame rows + shuffled_df = df.sample(frac=1).reset_index(drop=True) - output_file_path = "data_finished.csv" - df.to_csv(output_file_path, index=False) + output_file_path = "data_shuffle.csv" + shuffled_df.to_csv(output_file_path, index=False) return FlyteFile(output_file_path) ... From 3d881879779fc88167d12b62fd126c9e67622b6e Mon Sep 17 00:00:00 2001 From: Alex Wu Date: Sun, 27 Oct 2024 09:54:28 +0800 Subject: [PATCH 3/4] delete example code task decorator arguments Signed-off-by: Alex Wu --- docs/user_guide/concepts/main_concepts/data_management.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/concepts/main_concepts/data_management.rst b/docs/user_guide/concepts/main_concepts/data_management.rst index 7f3a423780..81f86bb1c0 100644 --- a/docs/user_guide/concepts/main_concepts/data_management.rst +++ b/docs/user_guide/concepts/main_concepts/data_management.rst @@ -186,7 +186,7 @@ The first task reads a file from the object store, shuffles the data, saves to l .. code-block:: python - @task(container_image=basic_image, cache=True, cache_version="1.0") + @task() def task_read_and_shuffle_file(input_file: FlyteFile) -> FlyteFile: """ Reads the input file as a DataFrame, shuffles the rows, and writes the shuffled DataFrame to a new file. From 18719e215bbe9e89f7dabe4798d4a34cd9a8e0bd Mon Sep 17 00:00:00 2001 From: Alex Wu Date: Thu, 31 Oct 2024 16:40:17 +0800 Subject: [PATCH 4/4] adjust the location of own datastores related information Signed-off-by: Alex Wu --- .../main_concepts/data_management.rst | 35 +++++++++---------- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/docs/user_guide/concepts/main_concepts/data_management.rst b/docs/user_guide/concepts/main_concepts/data_management.rst index 81f86bb1c0..6bb6eee730 100644 --- a/docs/user_guide/concepts/main_concepts/data_management.rst +++ b/docs/user_guide/concepts/main_concepts/data_management.rst @@ -159,24 +159,6 @@ Between Tasks .. image:: https://raw.githubusercontent.com/flyteorg/static-resources/9cb3d56d7f3b88622749b41ff7ad2d3ebce92726/flyte/concepts/data_movement/flyte_data_transfer.png - -Bringing in Your Own Datastores for Raw Data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Flytekit has a pluggable data persistence layer. -This is driven by PROTOCOL. -For example, it is theoretically possible to use S3 ``s3://`` for metadata and GCS ``gcs://`` for raw data. It is also possible to create your own protocol ``my_fs://``, to change how data is stored and accessed. -But for Metadata, the data should be accessible to Flyte control plane. - -Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib. - -Deleting Raw Data in Your Own Datastores -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Flyte does not offer a direct function to delete raw data stored in external datastores like ``S3`` or ``GCS``. However, you can manage deletion by configuring a lifecycle policy within your datastore service. - -If caching is enabled in your Flyte ``task``, ensure that the ``max-cache-age`` is set to be shorter than the lifecycle policy in your datastore to prevent potential data inconsistency issues. - Practical Example ~~~~~~~~~~~~~~~~~ @@ -247,3 +229,20 @@ First task output metadata: Second task input metadata: .. image:: https://raw.githubusercontent.com/flyteorg/static-resources/9cb3d56d7f3b88622749b41ff7ad2d3ebce92726/flyte/concepts/data_movement/flyte_data_movement_example_input.png + +Bringing in Your Own Datastores for Raw Data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Flytekit has a pluggable data persistence layer. +This is driven by PROTOCOL. +For example, it is theoretically possible to use S3 ``s3://`` for metadata and GCS ``gcs://`` for raw data. It is also possible to create your own protocol ``my_fs://``, to change how data is stored and accessed. +But for Metadata, the data should be accessible to Flyte control plane. + +Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib. + +Deleting Raw Data in Your Own Datastores +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Flyte does not offer a direct function to delete raw data stored in external datastores like ``S3`` or ``GCS``. However, you can manage deletion by configuring a lifecycle policy within your datastore service. + +If caching is enabled in your Flyte ``task``, ensure that the ``max-cache-age`` is set to be shorter than the lifecycle policy in your datastore to prevent potential data inconsistency issues. \ No newline at end of file