Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement][Task Plugin] Allows file transfer between tasks #10738

Closed
2 of 3 tasks
jieguangzhou opened this issue Jul 2, 2022 · 4 comments
Closed
2 of 3 tasks

[Enhancement][Task Plugin] Allows file transfer between tasks #10738

jieguangzhou opened this issue Jul 2, 2022 · 4 comments
Labels
backend feature new feature good idea help wanted Extra attention is needed

Comments

@jieguangzhou
Copy link
Member

jieguangzhou commented Jul 2, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

DolphinScheduler allows parameter transfer between tasks: https://dolphinscheduler.apache.org/en-us/docs/latest/user_doc/guide/parameter/context.html

But it can not allows file transfer between tasks. For example, I have two python scripts to do some analysis work. The second script process the data that come from the first script. I have to pass a path variable as a parameter.

Parameter passing will not work as expected if the two tasks are not the same worker, because actually, the path is not correct.

I think if DolphinScheduler supports this feature, it would be a handy boost for scenarios such as data analysis and machine learning.

Use case

I think we can use the resource center as a file transfer store If the user has enabled the resource center. For example, In the task plugin, we can agree on a new path specification:

  1. use $from_remote(remote_path, local_path) to download file from remote_path to local_path before task start.
  2. use $to_remote(remote_path, local_path) to upload file from local_path to remote_path

The appeal was inspired by AWS Sagemaker

base_uri = f"s3://{default_bucket}/abalone"
input_data_uri = sagemaker.s3.S3Uploader.upload(
    local_path=local_path, 
    desired_s3_uri=base_uri,
)
input_data = ParameterString(
    name="InputData",
    default_value=input_data_uri,
)

# This is the path to use directly
ProcessingInput(source=input_data, destination="/opt/ml/processing/input")

Above is the example of Sagemaker. If DolphinScheduler supports it, it should be easier to use it.
Such as

# It will process data and save output data to the local path output/demo.csv, and upload that to bucket1/demo.csv in the resource center after the task is done.
python process_data.py --output=$to_remote('bucket1/demo.csv', 'output/demo.csv')
# It will download data from "bucket1/demo.csv" in the resource center and save it to the local path "output/demo.csv"
# and than the following command actually executes
# python analysis.py --input=data/demo.csv
python analysis.py --input=$from_remote('bucket1/demo.csv', 'data/demo.csv')

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@jieguangzhou jieguangzhou added feature new feature Waiting for reply Waiting for reply labels Jul 2, 2022
@github-actions
Copy link

github-actions bot commented Jul 2, 2022

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

@jieguangzhou
Copy link
Member Author

I'm not sure if I'll be able to implement it anytime soon. If anyone is interested in implementing it, thank you very much

@SbloodyS SbloodyS added help wanted Extra attention is needed backend and removed Waiting for reply Waiting for reply labels Jul 2, 2022
@SbloodyS
Copy link
Member

SbloodyS commented Jul 2, 2022

I think this feature depends on configuration center #10283. Otherwise, it is impossible to determine which object to use to store the configuration during uploading and downloading.

@zhongjiajie
Copy link
Member

close by #12552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend feature new feature good idea help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants