-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring your own Data plugins for flytekit #559
Conversation
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #559 +/- ##
==========================================
- Coverage 85.73% 85.59% -0.14%
==========================================
Files 384 392 +8
Lines 30257 30576 +319
Branches 2427 2456 +29
==========================================
+ Hits 25940 26173 +233
- Misses 3660 3731 +71
- Partials 657 672 +15
Continue to review full report at Codecov.
|
Signed-off-by: wild-endeavor <[email protected]> Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: wild-endeavor <[email protected]> Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
cc @wild-endeavor this is ready for review |
Not sure how to fix this, but the error is We could add the fsspec entry into the we can add a separate command after this? |
@wild-endeavor This can create other problems, if we add another data plugin or the like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM ❤️ just added some minor comments .... like @wild-endeavor I ran into some issues getting the plugins installed so I could run the tests.
flytekit/core/data_persistence.py
Outdated
def construct_path(self, _: bool, add_prefix: bool, *args) -> str: | ||
# Ignore add_protocol for now. Only complicates things | ||
if add_prefix: | ||
return os.path.join(self.default_prefix, *args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.default_prefix
can be None
, which will throw a TypeError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
flytekit/core/data_persistence.py
Outdated
pass | ||
|
||
@abstractmethod | ||
def construct_path(self, add_protocol: bool, add_prefix: bool, *paths) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def construct_path(self, add_protocol: bool, add_prefix: bool, *paths) -> str: | |
def construct_path(self, add_protocol: bool, add_prefix: bool, *paths: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
durable store. | ||
""" | ||
|
||
def __init__(self, local_sandbox_dir: Union[str, os.PathLike], raw_output_prefix: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused by the concept sandbox here, because I thought it was referring to the Flyte sandbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya we call it the output prefix or local data dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess i can update it, but would like to do that as a separate PR
flytekit/extras/persistence/http.py
Outdated
import os | ||
import pathlib | ||
|
||
import requests as _requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my own curiosity, i was wondering why you rename imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, this is an artifact of some previous work in flytekit. the reasoning was to avoid pollution of the import space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the new flytekit, we should just get rid of it, I will, good call out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done for all new files
return _update_cmd_config_and_execute(cmd) | ||
|
||
def construct_path(self, add_protocol: bool, add_prefix: bool, *paths) -> str: | ||
paths = list(paths) # make type check happy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy and pyright are still failing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately we're not really strictly following mypy right now... for reasons I forget (?) @kumare3 @wild-endeavor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should - we just haven't time to go through the whole codebase and clean everything up.
version=__version__, | ||
author="flyteorg", | ||
author_email="[email protected]", | ||
description="This package holds Hive plugins for flytekit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description="This package holds Hive plugins for flytekit", | |
description="This package holds the fsspec data persistence plugin for flytekit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return fsspec.filesystem(protocol, **kwargs) | ||
|
||
@staticmethod | ||
def recursive_paths(f: str, t: str) -> (str, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def recursive_paths(f: str, t: str) -> (str, str): | |
def recursive_paths(f: str, t: str) -> typing.Tuple[str, str]: |
# The s3api command returns an error if the object does not exist. The error message contains | ||
# the http status code: "An error occurred (404) when calling the HeadObject operation: Not Found" | ||
# This is a best effort for returning if the object does not exist by searching | ||
# for existence of (404) in the error message. This should not be needed when we get off the cli and use lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the plan to replace the cli with fsspec using s3fs? or a boto implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot replace, as there are existing users of this. but users can optionally install pip install flytekitplugins-data-fsspec
or other such solutions in the future
raise _FlyteUserException("AWS CLI not found! Please install it with `pip install awscli`.") | ||
|
||
@staticmethod | ||
def _split_s3_path_to_bucket_and_key(path: str) -> (str, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the syntax error in mypy/pyright:
def _split_s3_path_to_bucket_and_key(path: str) -> (str, str): | |
def _split_s3_path_to_bucket_and_key(path: str) -> Tuple[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
35920c7
Signed-off-by: Samhita Alla <[email protected]>
@kumare3 / @eapolinario Tests are passing now! Modified the Great Expectations plugin to accept remote FlyteFiles only. If the user wants to give a local dataset file, it can only be given as a string. I've also removed the concerned local FlyteFile tests. See this commit: 35920c7. When given a local dataset file that has a relative path, FlyteFile is not working because of this: def is_remote(path: Union[str, os.PathLike]) -> bool:
"""
Deprecated. Lets find a replacement
"""
return not (path.startswith("/") or path.startswith("file://")) I've not added FlyteFile's |
TL;DR
Flytekit depends on some additional cloud-provided tooling for its typical operation. Specifically, when downloading or uploading task inputs or outputs, or when downloading or uploading blob-storage-based literals (files, schemas, folders), it makes use cloud provider CLI's like
aws cli
andgsutil
to handle the transfer. This is a bit of a legacy decision, which was made prior to python3 (in python2 days). Without asyncio, the performance of the downloader would not match the ones offered by the CLIs.Moreover, using the CLI's actually removes coupling with cloud provider SDKs and hence keep the flytekit core dependency free. This has really worked well and we will continue to keep it going forward. But, we would love for contributors to bring in new ideas on handling this data - for example use non blob stores (like bigtable) for the metadata, or write native clients for managing the data or use a mature project like fsspec
The PR introduces a new
Persistence
Interface and providesextras
foraws cli
andgsutil
. Users have to simply installpip install awscli
orpip install gsutil
to use these extras. Moreover, the Persistence layer allows plugins and overriding the default data extras that are bundled.This RFC provides the design intuition. An example data plugin will be added as a follow on, and we will use automatic plugin discovery to autoload the plugin.
Type
Are all requirements met?
Complete description
DataPersistence
base class. This encapsulates the behavior of a blob storage provider and has things like get/list/put, etc.DataPersistencePlugins
class - this is the keeper of plugins, similar to how theTypeEngine
keeps track of type transformersDiskPersistence
that inherits theDataPersistence
class. The local FS can be thought of as a storage provider - this object fakes that. Paths that begin with/
orfile://
by default will use this persistence plugin.FileAccessProvider
to under thecore/
folder. This object is the high level interface that the rest of the flytekit platform and its plugins interact with. Users/contributors should not need to work with the underlyingDataPersistence
classes (unless of course you're implementing one).Tracking Issue
fixes flyteorg/flyte#809