-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multipart blob download #5715
base: master
Are you sure you want to change the base?
Support multipart blob download #5715
Conversation
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5715 +/- ##
==========================================
+ Coverage 36.21% 36.74% +0.52%
==========================================
Files 1303 1304 +1
Lines 109644 130160 +20516
==========================================
+ Hits 39710 47829 +8119
- Misses 65810 78148 +12338
- Partials 4124 4183 +59
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wayner0628 - i think this is good. I want to get @eapolinario or @EngHabu to take a quick look at this as well though. This is a pretty core interface that's changing in this PR.
flytestdlib/storage/storage.go
Outdated
@@ -78,6 +78,9 @@ type RawStore interface { | |||
// Head gets metadata about the reference. This should generally be a light weight operation. | |||
Head(ctx context.Context, reference DataReference) (Metadata, error) | |||
|
|||
// GetItems retrieves the paths of all items from the Blob store or an error | |||
GetItems(ctx context.Context, reference DataReference) ([]string, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be more accurately named ListItems? Also what is retrieved? The relative path to the reference input? can we add comment?
flytestdlib/storage/mem_store.go
Outdated
@@ -54,6 +55,23 @@ func (s *InMemoryStore) Head(ctx context.Context, reference DataReference) (Meta | |||
}, nil | |||
} | |||
|
|||
func (s *InMemoryStore) GetItems(ctx context.Context, reference DataReference) ([]string, error) { | |||
var items []string | |||
prefix := string(reference) + "/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will reference ever already have a /?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @wayner0628
Can you test cases like this PR?
flyteorg/flytekit#2258
To be more specifically, this case
flyte_dir_io = ContainerTask(
name="flyte_dir_io",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(inputs=FlyteDirectory),
outputs=kwtypes(out=FlyteDirectory),
image="futureoutlier/rawcontainer:0320",
command=[
"python",
"write_flytedir.py",
"{{.inputs.inputs}}",
"/var/outputs/out",
],
)
If possible, please proivde screenshot, thank you.
There is also this PR, https://github.com/flyteorg/flyte/pull/5674/files which I think we should merge first. The change to core api should probably be done separately. |
@wayner0628 #5741 this was just merged, adding a list api to the storage client. mind using the new interface to do this? |
@wild-endeavor No problem, I'll update this PR to align with the new interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tips to develop copilot in single binary.
- config
plugins:
logs:
dynamic-log-links:
- comet-ml-execution-id:
displayName: Comet
templateUris: "{{ .taskConfig.host }}/{{ .taskConfig.workspace }}/{{ .taskConfig.project_name }}/{{ .executionName }}{{ .nodeId }}{{ .taskRetryAttempt }}{{ .taskConfig.link_suffix }}"
- comet-ml-custom-id:
displayName: Comet
templateUris: "{{ .taskConfig.host }}/{{ .taskConfig.workspace }}/{{ .taskConfig.project_name }}/{{ .taskConfig.experiment_key }}"
kubernetes-enabled: true
kubernetes-template-uri: http://localhost:30080/kubernetes-dashboard/#/log/{{.namespace }}/{{ .podName }}/pod?namespace={{ .namespace }}
cloudwatch-enabled: false
stackdriver-enabled: false
k8s:
default-env-vars:
- FLYTE_AWS_ENDPOINT: "http://flyte-sandbox-minio.flyte:9000"
- FLYTE_AWS_ACCESS_KEY_ID: minio
- FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
- MLFLOW_TRACKING_URI: postgresql+psycopg2://postgres:@postgres.flyte.svc.cluster.local:5432/flyteadmin
co-pilot:
image: "localhost:30000/copilot-flytefile:0603"
- how to build copilot image?
useDockerfile.flytecopilot
to build it.
…ltipart-blob Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Hi @Future-Outlier and @wild-endeavor, I’ve been encountering an issue while running a Flytekit test case. The error I'm seeing is as follows:
The error with flytectl demo start --dev
POD_NAMESPACE=flyte ./flyte start --config flyte-single-binary-local.yaml
pyflyte run --remote raw_container.py calculate_ellipse_area_shell --a 1.1 --b 1.2 Environment Details:
I build, tag and push the modified docker image when testing this PR, but I did not use modified image for the Flytesnacks, it still failed. This has been blocking me for a couple of weeks now. I’ll continue investigating, but any help or guidance you could provide would be greatly appreciated! Thank you in advance. |
Can you show me your config file? |
It's the original one, I used to run Flytesnacks
|
you have to add co-pilot image. k8s:
default-env-vars:
- FLYTE_AWS_ENDPOINT: "http://flyte-sandbox-minio.flyte:9000"
- FLYTE_AWS_ACCESS_KEY_ID: minio
- FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
- MLFLOW_TRACKING_URI: postgresql+psycopg2://postgres:@postgres.flyte.svc.cluster.local:5432/flyteadmin
co-pilot:
image: "cr.flyte.org/flyteorg/flytecopilot:v1.13.1" |
@Future-Outlier , I'll try it later, thank you |
@Future-Outlier , I add copilot image |
@wayner0628 show me your python code and show your whole k8s config. |
import logging
from flytekit import ContainerTask, kwtypes, task, workflow
logger = logging.getLogger(__file__)
# A `flytekit.ContainerTask` denotes an arbitrary container. In the following example, the name of the task
# is `calculate_ellipse_area_shell`. This name has to be unique in the entire project. Users can specify:
#
# - `input_data_dir` -> where inputs will be written to.
# - `output_data_dir` -> where Flyte will expect the outputs to exist.
# `inputs` and `outputs` specify the interface for the task; thus it should be an ordered dictionary of typed input and
# output variables.
calculate_ellipse_area_shell = ContainerTask(
name="ellipse-area-metadata-shell",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(a=float, b=float),
outputs=kwtypes(area=float, metadata=str),
image="ghcr.io/flyteorg/rawcontainers-shell:v2",
command=[
"./calculate-ellipse-area.sh",
"{{.inputs.a}}",
"{{.inputs.b}}",
"/var/outputs",
],
)
calculate_ellipse_area_python = ContainerTask(
name="ellipse-area-metadata-python",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(a=float, b=float),
outputs=kwtypes(area=float, metadata=str),
image="ghcr.io/flyteorg/rawcontainers-python:v2",
command=[
"python",
"calculate-ellipse-area.py",
"{{.inputs.a}}",
"{{.inputs.b}}",
"/var/outputs",
],
)
calculate_ellipse_area_r = ContainerTask(
name="ellipse-area-metadata-r",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(a=float, b=float),
outputs=kwtypes(area=float, metadata=str),
image="ghcr.io/flyteorg/rawcontainers-r:v2",
command=[
"Rscript",
"--vanilla",
"calculate-ellipse-area.R",
"{{.inputs.a}}",
"{{.inputs.b}}",
"/var/outputs",
],
)
calculate_ellipse_area_haskell = ContainerTask(
name="ellipse-area-metadata-haskell",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(a=float, b=float),
outputs=kwtypes(area=float, metadata=str),
image="ghcr.io/flyteorg/rawcontainers-haskell:v2",
command=[
"./calculate-ellipse-area",
"{{.inputs.a}}",
"{{.inputs.b}}",
"/var/outputs",
],
)
calculate_ellipse_area_julia = ContainerTask(
name="ellipse-area-metadata-julia",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(a=float, b=float),
outputs=kwtypes(area=float, metadata=str),
image="ghcr.io/flyteorg/rawcontainers-julia:v2",
command=[
"julia",
"calculate-ellipse-area.jl",
"{{.inputs.a}}",
"{{.inputs.b}}",
"/var/outputs",
],
)
@task
def report_all_calculated_areas(
area_shell: float,
metadata_shell: str,
area_python: float,
metadata_python: str,
area_r: float,
metadata_r: str,
area_haskell: float,
metadata_haskell: str,
area_julia: float,
metadata_julia: str,
):
logger.info(f"shell: area={area_shell}, metadata={metadata_shell}")
logger.info(f"python: area={area_python}, metadata={metadata_python}")
logger.info(f"r: area={area_r}, metadata={metadata_r}")
logger.info(f"haskell: area={area_haskell}, metadata={metadata_haskell}")
logger.info(f"julia: area={area_julia}, metadata={metadata_julia}")
# If you’re using Flytekit version >= v1.11.1, you can execute it locally.
# For example, `pyflyte run raw_container.py calculate_ellipse_area_shell --a 1.1 --b 1.2`
#
# As can be seen in this example, `ContainerTask`s can be interacted with like normal Python functions, whose inputs
# correspond to the declared input variables. All data returned by the tasks are consumed and logged by a Flyte task.
@workflow
def wf(a: float, b: float):
# Calculate area in all languages
area_shell, metadata_shell = calculate_ellipse_area_shell(a=a, b=b)
area_python, metadata_python = calculate_ellipse_area_python(a=a, b=b)
area_r, metadata_r = calculate_ellipse_area_r(a=a, b=b)
area_haskell, metadata_haskell = calculate_ellipse_area_haskell(a=a, b=b)
area_julia, metadata_julia = calculate_ellipse_area_julia(a=a, b=b)
# Report on all results in a single task to simplify comparison
report_all_calculated_areas(
area_shell=area_shell,
metadata_shell=metadata_shell,
area_python=area_python,
metadata_python=metadata_python,
area_r=area_r,
metadata_r=metadata_r,
area_haskell=area_haskell,
metadata_haskell=metadata_haskell,
area_julia=area_julia,
metadata_julia=metadata_julia,
) @Future-Outlier , k8s config, you mean # This is a sample configuration file for running single-binary Flyte locally against
# a sandbox.
admin:
# This endpoint is used by flytepropeller to talk to admin
# and artifacts to talk to admin,
# and _also_, admin to talk to artifacts
endpoint: localhost:30080
insecure: true
catalog-cache:
endpoint: localhost:8081
insecure: true
type: datacatalog
cluster_resources:
standaloneDeployment: false
templatePath: $HOME/.flyte/sandbox/cluster-resource-templates
logger:
show-source: true
level: 5
propeller:
create-flyteworkflow-crd: true
kube-config: $HOME/.flyte/sandbox/kubeconfig
rawoutput-prefix: s3://my-s3-bucket/data
server:
kube-config: $HOME/.flyte/sandbox/kubeconfig
webhook:
certDir: $HOME/.flyte/webhook-certs
localCert: true
secretName: flyte-sandbox-webhook-secret
serviceName: flyte-sandbox-local
servicePort: 9443
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- K8S-ARRAY
- agent-service
- echo
default-for-task-types:
- container: container
- container_array: K8S-ARRAY
plugins:
logs:
kubernetes-enabled: true
kubernetes-template-uri: http://localhost:30080/kubernetes-dashboard/#/log/{{.namespace }}/{{ .podName }}/pod?namespace={{ .namespace }}
cloudwatch-enabled: false
stackdriver-enabled: false
k8s:
default-env-vars:
- FLYTE_AWS_ENDPOINT: "http://flyte-sandbox-minio.flyte:9000"
- FLYTE_AWS_ACCESS_KEY_ID: minio
- FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
- MLFLOW_TRACKING_URI: postgresql+psycopg2://postgres:@postgres.flyte.svc.cluster.local:5432/flyteadmin
co-pilot:
image: "cr.flyte.org/flyteorg/flytecopilot:v1.13.1"
image-pull-policy: Always # Helps in better iteration of flytekit changes
k8s-array:
logs:
config:
kubernetes-enabled: true
kubernetes-template-uri: http://localhost:30080/kubernetes-dashboard/#/log/{{.namespace }}/{{ .podName }}/pod?namespace={{ .namespace }}
cloudwatch-enabled: false
stackdriver-enabled: false
database:
postgres:
username: postgres
password: postgres
host: 127.0.0.1
port: 30001
dbname: flyte
options: "sslmode=disable"
storage:
type: stow
stow:
kind: s3
config:
region: us-east-1
disable_ssl: true
v2_signing: true
endpoint: http://localhost:30002
auth_type: accesskey
access_key_id: minio
secret_key: miniostorage
container: my-s3-bucket
task_resources:
defaults:
cpu: 500m
memory: 500Mi
limits:
cpu: 4
memory: 4Gi |
Thank you, I just tested it, and found that this is break by others.... |
Thanks for the insights, @Future-Outlier. I was able to test my copilot image and noticed an error in
It seems like there’s a missing |
No problem, reply here anytime, I'll be there |
Hi @Future-Outlier, I have addressed the technical aspects of the issues, but I need some clarification on a couple of conceptual points before I can finalize the PR:
Once I have clarity on these points, I believe I can deliver the final PR very quickly. Thank you! |
Signed-off-by: wayner0628 <[email protected]>
Tracking issue
#3632
Why are the changes needed?
Supporting multipart blob downloads allows us to completely copy the specified directory into the input path.
What changes were proposed in this pull request?
List
api to collect items under container before downloadList
api for memory storageHow was this patch tested?
unit tests, specifically in
download_test.go
Setup process
Screenshots
Check all the applicable boxes
Related PRs
flyteorg/flytekit#2258
Docs link
NA