RawDataOutput directory for every task execution #67

kumare3 · 2020-03-07T04:41:13Z

TL;DR

This PR enables FlytePropeller to create data sandboxes for the executors (containers or plugins) to use as data storage.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

Currently Flyte allows exactly once processing semantics for processes that use flytekit data handling and do not produce any side effects. This is possible because flytekit creates a random data sandbox directory (in the blobstore) which is commited (i.e., commit here implies that a user can perceive a task to be complete only when the task completion is successfully recorded by flytepropeller, this is just a trick, concurrent executions are thus disregarded only the first one to return wins/commits).

This commit promotes this sandbox creation from flytekit into flytepropeller. this enables plugins that do not use flytekit or other sdks can simply rely on the {{ .outputs.sandbox }} directory parameter that is passed as an input. Every retry gets a new random sandbox. This also enables to mark some sandboxes are active and this making is possible to perform selective garbage collection.

Tracking Issue

flyteorg/flyte#195

Follow-up issue

flyteorg/flyte#211 - Per workflow data sandboxes

wild-endeavor · 2020-03-14T00:17:52Z

go/tasks/pluginmachinery/io/iface.go

+// FlytePluginMachinery so that it can be used more universally outside of Flytekit.
+type OutputDataSandbox interface {
+	// This is prefix (blob store prefix or directory) where all data produced can be stored.
+	GetOutputDataSandboxPath() storage.DataReference


is sandbox the best name? That implies something to do with the sandbox deployment i feel like.

hmm true, it is really a sandbox. Do you have any suggestions?

I like the name sandbox for this, but lets really think of a good alternative, then i can change. Otherwise lets stick with this. So i called it DataOutputSandbox

AtRestStorage, UserStorageLocation, FlyteHardDrive, FlyteBlackBox, FlyteRecorder, UserOutputLocation, UserOutputStorage, FlyteDurableStore, etc. I don't really care that much, but i find some Flyte things already unnecessarily confusing and would just rather err on the side of less confusing.

EngHabu · 2020-03-17T17:13:07Z

go/tasks/pluginmachinery/io/iface.go

+// FlytePluginMachinery so that it can be used more universally outside of Flytekit.
+type OutputDataSandbox interface {
+	// This is prefix (blob store prefix or directory) where all data produced can be stored.
+	GetOutputDataSandboxPath() storage.DataReference


Suggested change

GetOutputDataSandboxPath() storage.DataReference

GetOutputDataSandboxPrefix() storage.DataReference

Should we change the name sandbox to something else. Let me think

EngHabu · 2020-03-17T17:15:06Z

go/tasks/pluginmachinery/io/iface.go

@@ -51,6 +61,8 @@ type OutputFilePaths interface {
 	// A Fully qualified path (URN) where the error information should be placed as a protobuf core.ErrorDocument. It is not directly
 	// used by the framework, but could be used in the future
 	GetErrorPath() storage.DataReference
+
+	OutputDataSandbox


nit: Can you move it to the top of the interface... I find it a bit more readable..

EngHabu · 2020-03-17T17:17:15Z

go/tasks/pluginmachinery/io/iface.go

+// of a task (across retries etc) and is constant for a specific execution.
+// As of 02/20/2020 Flytekit generates this path randomly for S3. This structure proposes migration of this logic to
+// FlytePluginMachinery so that it can be used more universally outside of Flytekit.
+type OutputDataSandbox interface {


Do we really need an interface here? I feel like we should just move the method into OutputFilePaths interface... that or we keep it as a separate interface but do not include it in the other interface, include it directly in the implementation struct.

EngHabu · 2020-03-17T17:23:24Z

go/tasks/pluginmachinery/ioutils/output_sandbox.go

+// Determinism depends on the outputMetadataPath
+// Potential performance problem, as creating anew randomprefixShardedOutput Sandbox may be expensive as it hashes the outputMetadataPath
+// the final OutputSandbox is created in the shard selected by the sharder at the basePath and then appended by a hashed value of the outputMetadata
+func NewRandomPrefixShardedOutputSandbox(ctx context.Context, sharder ShardSelector, basePath, outputMetadataPath storage.DataReference, store storage.ReferenceConstructor) (io.OutputDataSandbox, error) {


This is not random prefix, right? this depends on the ShardSelector. This guy is pretty deterministic otherwise... right?

it is very deterministic by design, i should rename this constructor. Let me think about the names

wild-endeavor · 2020-03-17T18:25:41Z

go/tasks/pluginmachinery/io/iface.go

+// of a task (across retries etc) and is constant for a specific execution.
+// As of 02/20/2020 Flytekit generates this path randomly for S3. This structure proposes migration of this logic to
+// FlytePluginMachinery so that it can be used more universally outside of Flytekit.
+type OutputDataSandbox interface {


wild-endeavor · 2020-03-17T18:41:59Z

go/tasks/pluginmachinery/ioutils/precomputed_shardselector.go

+}
+
+// uses the given shards to select a shard
+func NewConstantShardSelector(shards []string) ShardSelector {


I still prefer to accept interfaces and return specific types.

you cannot, the linter wont let you for non exported types

But PrecomputedShardSelector is exported

good point i will unexport it

wild-endeavor · 2020-03-17T18:48:40Z

go/tasks/plugins/array/catalog.go

 	index int) (io.OutputReader, error) {
-	dataReference, err := dataStore.ConstructReference(ctx, outputPrefix, strconv.Itoa(index))
+	strIndex := strconv.Itoa(index)
+	dataReference, err := dataStore.ConstructReference(ctx, outputPrefix, strIndex)


i don't follow. Why are there two data references now? there's the dataReference that's in the original code, and the new outputSandbox - what's the difference between the two and why do we need both?

so the first reference is for where the metadata is stored -> OutputPath (should have been OutputMetadata). this where essentially where outputs.pb, futures.pb etc
the new one is what is today generated by flytekit, location where data is stored for the execution from the container. This is never read by flytepropeller, but it is better to generate it from propeller

wild-endeavor · 2020-03-18T02:02:17Z

go/tasks/pluginmachinery/ioutils/remote_file_output_writer.go

@@ -71,10 +72,11 @@ func (w RemoteFileOutputWriter) Put(ctx context.Context, reader io.OutputReader)
 	return fmt.Errorf("no data found to write")
 }

-func NewRemoteFileOutputPaths(_ context.Context, store storage.ReferenceConstructor, outputPrefix storage.DataReference) RemoteFileOutputPaths {


given that now there's a new datareference as part of io.RawOutputPaths would it help to add some more functions to RemoteFileOutputReader?

Nope, we never want to read that data. Now that we call RawOutput that should be more clear.

EngHabu · 2020-03-18T16:52:26Z

go/tasks/pluginmachinery/ioutils/precomputed_shardselector.go

+}
+
+// uses the given shards to select a shard
+func NewConstantShardSelector(shards []string) ShardSelector {


But PrecomputedShardSelector is exported

go/tasks/pluginmachinery/ioutils/raw_output_path.go

Co-Authored-By: Haytham AbuelFutuh <[email protected]>

…ns into adding-dataoutput-prefix

Ketan Umare added 7 commits March 3, 2020 22:52

Work in progress

6ba859a

work in progress

43ab990

work in progress

dd6996d

Merge branch 'master' into adding-dataoutput-prefix

e141907

updated tests

4dd43b1

unit testing in progress

5a3cb44

Merge branch 'master' into adding-dataoutput-prefix

f6718bf

wild-endeavor reviewed Mar 14, 2020

View reviewed changes

Ketan Umare added 2 commits March 16, 2020 13:01

Merge branch 'master' into adding-dataoutput-prefix

96170d0

Unit tests added

52992e5

kumare3 changed the title ~~[WIP] Adding dataoutput prefix~~ DataOutput Sandbox directory for every task execution Mar 16, 2020

Unit test fixes

f6d8c91

kumare3 requested review from wild-endeavor, EngHabu and lu4nm3 March 17, 2020 00:14

Ketan Umare added 3 commits March 16, 2020 17:21

lint fixes

cce3a00

updated hasing algorithm

c9c9d9a

updated output sandbox constructor

833b625

EngHabu reviewed Mar 17, 2020

View reviewed changes

wild-endeavor reviewed Mar 17, 2020

View reviewed changes

Renamed Sandbox -> RawOutputPath

ea42807

wild-endeavor reviewed Mar 18, 2020

View reviewed changes

kumare3 changed the title ~~DataOutput Sandbox directory for every task execution~~ RawDataOutput directory for every task execution Mar 18, 2020

kumare3 requested a review from EngHabu March 18, 2020 03:55

EngHabu previously approved these changes Mar 18, 2020

View reviewed changes

lu4nm3 previously approved these changes Mar 19, 2020

View reviewed changes

Ketan Umare and others added 2 commits March 23, 2020 21:33

Merge branch 'master' into adding-dataoutput-prefix

d4df89d

Update go/tasks/pluginmachinery/ioutils/raw_output_path.go

4e16940

Co-Authored-By: Haytham AbuelFutuh <[email protected]>

kumare3 dismissed stale reviews from lu4nm3 and EngHabu via 4e16940 March 24, 2020 04:41

Ketan Umare and others added 5 commits March 23, 2020 21:42

Update go/tasks/pluginmachinery/ioutils/raw_output_path.go

dd03159

Co-Authored-By: Haytham AbuelFutuh <[email protected]>

Merge branch 'master' into adding-dataoutput-prefix

234ba97

Merge branch 'adding-dataoutput-prefix' of github.com:lyft/flyteplugi…

7208977

…ns into adding-dataoutput-prefix

rename issues

d84b749

rename fix

434bd71

EngHabu approved these changes Mar 25, 2020

View reviewed changes

kumare3 merged commit 000cec8 into master Mar 25, 2020

eapolinario pushed a commit that referenced this pull request Sep 6, 2023

RawDataOutput directory for every task execution (#67)

c97d1d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RawDataOutput directory for every task execution #67

RawDataOutput directory for every task execution #67

kumare3 commented Mar 7, 2020 •

edited

Loading

wild-endeavor Mar 14, 2020

kumare3 Mar 16, 2020

kumare3 Mar 17, 2020

wild-endeavor Mar 17, 2020

EngHabu Mar 17, 2020

kumare3 Mar 17, 2020

EngHabu Mar 17, 2020

kumare3 Mar 17, 2020

EngHabu Mar 17, 2020

wild-endeavor Mar 17, 2020

EngHabu Mar 17, 2020

kumare3 Mar 17, 2020

wild-endeavor Mar 17, 2020

wild-endeavor Mar 17, 2020

kumare3 Mar 18, 2020

EngHabu Mar 18, 2020

kumare3 Mar 24, 2020

wild-endeavor Mar 17, 2020

kumare3 Mar 18, 2020

wild-endeavor Mar 18, 2020

kumare3 Mar 18, 2020

EngHabu Mar 18, 2020

	GetOutputDataSandboxPath() storage.DataReference
	GetOutputDataSandboxPrefix() storage.DataReference

RawDataOutput directory for every task execution #67

RawDataOutput directory for every task execution #67

Conversation

kumare3 commented Mar 7, 2020 • edited Loading

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kumare3 commented Mar 7, 2020 •

edited

Loading