Skip to content
This repository has been archived by the owner on Oct 23, 2023. It is now read-only.

Allow different modes of data download and upload in flyte co-pilot #65

Merged
merged 2 commits into from
May 26, 2020

Conversation

kumare3
Copy link
Contributor

@kumare3 kumare3 commented May 22, 2020

TL;DR

Blob Handling in CoPilot should have different options. We will implement only one in the beginning) but this provides ease of extensibility

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Eventually, we want to support Blob download as a stream (overlap with processing), eager (downloaded all at once) or no download (user will handle their own downloading)
We also want to support similar modes for uploads

Tracking Issue

flyteorg/flyte#297

Follow-up issue

NA

Comment on lines 200 to 207
enum BlobDownload {
// All data will be downloaded before the main container is executed
BEFORE_STARTUP = 0;
// Data will be downloaded as a stream and an End-Of-Stream marker will be written to indicate all data has been downloaded. Refer to protocol for details
STREAM = 1;
// Large objects (offloaded) will not be downloaded
DO_NOT_DOWNLOAD = 2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names will get all mangled up, right?
They will look something like DataLoadingConfig_BEFORE_STARTUP, DataLoadingConfig_ON_EXIT... at least in golang, you won't know which enum they belong to...
You might want to create a nested message, something like this:

Suggested change
enum BlobDownload {
// All data will be downloaded before the main container is executed
BEFORE_STARTUP = 0;
// Data will be downloaded as a stream and an End-Of-Stream marker will be written to indicate all data has been downloaded. Refer to protocol for details
STREAM = 1;
// Large objects (offloaded) will not be downloaded
DO_NOT_DOWNLOAD = 2;
}
message BlobDownload {
enum Strategy {
BEFORE_STARTUP = 0;
...
}
}
BlobDownload.Staregy blob_download_strategy = 5;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya they get mangled up. But creating a message is an overhead? Maybe i can create a message called
message IOStrategy{
enum Download
enum Upload
}

// Data will not be uploaded, only references will be written
DO_NOT_UPLOAD = 2;
}
BlobDownload download_strategy = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BlobDownload download_strategy = 5;
BlobDownload blob_download_strategy = 5;

// Flag enables DataLoading Config. If this is not set, data loading will not be used!
bool enabled = 4;
// Strategy to use when dealing with Blob, Schema, or multipart blob data (large datasets)
enum BlobDownload {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need the same for Schema type?
Should we have the setting say something like OffloadedTypeDownload.Staregy instead? and apply that to blobs and schemas (and potentially later to array outputs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second, I'd much prefer to not have Blob in the name.

Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the use-case for this is in the PR description? We're trying to control whether blobs and schemas get downloaded before the container starts running?

If a Blob input is not necessary for the user container, why is it an input to the task? I'd like to better understand the situations where you want the input but don't want the actual input.

Like would a broader input mask/whitelist work better for these scenarios?

// Flag enables DataLoading Config. If this is not set, data loading will not be used!
bool enabled = 4;
// Strategy to use when dealing with Blob, Schema, or multipart blob data (large datasets)
enum BlobDownload {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second, I'd much prefer to not have Blob in the name.

@kumare3
Copy link
Contributor Author

kumare3 commented May 23, 2020

Can you explain what the use-case for this is in the PR description? We're trying to control whether blobs and schemas get downloaded before the container starts running?

If a Blob input is not necessary for the user container, why is it an input to the task? I'd like to better understand the situations where you want the input but don't want the actual input.

Like would a broader input mask/whitelist work better for these scenarios?

I am not sure if the description was updated when you looked at the PR, can you look again and let me know

Blob input is not necessary for the user container, why is it an input to the task download it is optional because the "Code" itself may want to handle that for various reasons

Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these names are infinitely better.

@kumare3
Copy link
Contributor Author

kumare3 commented May 26, 2020

Cool, merging!

@kumare3 kumare3 merged commit 8aff8a0 into master May 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants