Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Deadline Config File Issue on Bundle Submission #386

Closed
pta200 opened this issue Jul 2, 2024 · 3 comments · Fixed by #444
Closed

Bug: Deadline Config File Issue on Bundle Submission #386

pta200 opened this issue Jul 2, 2024 · 3 comments · Fixed by #444
Labels
bug Something isn't working

Comments

@pta200
Copy link

pta200 commented Jul 2, 2024

Expected Behaviour

Execute Deadline Cloud submit job bundle in parallel to speed up job submission process without generating an cli errors.

Current Behaviour

When trying to submit twenty job bundles in parallel batches of five jobs the deadline cli starts throwing errors. It appears that after each job submissions deadline cli writes the job id to the .deadline/config file. As such when submitting jobs in parallel there is likely contention when updating the resulting in an issue where all the values for farm_id, queue_id, and storage_profile_id are missing. As such the next job fails. A work around is to use the submit command parameters e.g "--farm-id" etc.... , but storage_profile_id is not included as parameter so any jobs needing to upload a file can't be automated ask it triggers a prompt.

Reproduction Steps

Ensure the .deadline/config is correctly setup with a profile, farm_id, queue_id, and storage_profile_id . Then use openjd.model to generate a job bundle and ProcessPoolExecutor to submit those jobs in parallel in batches of 5 calling the deadline cli from a python subprocess. e.g. "deadline bundle submit --yes -p InFile=/tmp/test_script.py /tmp/tmpy2f5jdu8". Here the cli is using the config to know what farm/queue to submit the bundle.

Sample .deadline/config file:

[telemetry]
identifier = 6b09b2cf-d296-4355-a125-d73a4233067c

[deadline-cloud-monitor]
path = /opt/DeadlineCloudMonitor/deadline-cloud-monitor_1.1.2_amd64.AppImage

[defaults]
aws_profile_name = test-us-east-1

[profile-test-us-east-1 defaults]
farm_id = farm-XXXXX

[profile-test-us-east-1 farm-XXX defaults]
queue_id = queue-XXXX

[profile-test-us-east-1 farm-XXXXX settings]
storage_profile_id = sp-XXXXXXX

[profile-test-us-east-1 farm-XXXX queue-XXXXXX defaults]
job_id = job-d9093dc0ece34453a69e73246c9d8e43

Eventually you'll get some version of a CalledProcessError when the deadline cli fails to submit the job e.g.
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'deadline bundle submit --yes -p InFile=/tmp/test_script.py /tmp/tmpy2f5jdu8' returned non-zero exit status 1.

When you look at the config file it now reads as follows with all the other configurations missing and only the last successfully submitted job. As such no further job submission works unless you use the options in the cli or fix the config file.

[profile-(default)   defaults]
job_id = job-3d75277939134f4e82fff8669398196d

Code Snippet

with ProcessPoolExecutor(max_workers=5) as executor:
futures = set()
for x in range(20):
futures.add(executor.submit(submit_job)
if (x+1) % 5 == 0:
done, futures = wait(futures, return_when=ALL_COMPLETED)
logger.info("next batch....")
futures.clear()

@pta200 pta200 added the bug Something isn't working label Jul 2, 2024
@epmog
Copy link
Contributor

epmog commented Jul 3, 2024

Hey thanks for the bug report!

For a little more context, what code path is your submit_job function using? Perhaps strictly an example, but i'll assume it's the CLI based on the other examples/context

There's a few spots where this can pop up, and some allow you to bypass the setting.

  • deadline bundle submit cli:
    # Check Whether the CLI options are modifying any of the default settings that affect
    # the job id. If not, we'll save the job id submitted as the default job id.
    # If the submission is canceled by the user job_id will be None, so ignore this case as well.
    if (
    job_id is not None
    and args.get("profile") is None
    and args.get("farm_id") is None
    and args.get("queue_id") is None
    ):
    set_setting("defaults.job_id", job_id)
    • if no overrides are provided, it'll default to the config file and update it afterwards
  • deadline.client.api.create_job_from_job_bundle:
    # If using the default config, set the default job id so it holds the
    # most-recently submitted job.
    if config is None:
    set_setting("defaults.job_id", job_id)
    • If you pass in a config object it won't set it after submitting
  • least likely, submit progress dialog:
    # Set the default job id so it holds the most-recently submitted job.
    set_setting("defaults.job_id", job_id)

My hunch here is that if it's an interactive submission with defaults then we should set the value to make it easier for users to inspect their job submissions. Otherwise if we're doing batch/background operations we should ignore updating it.

@pta200
Copy link
Author

pta200 commented Jul 9, 2024

Yes it's a submission using the CLI with a generated job bundle directory e.g /tmp/xyz123abc using the defaults which are in the config file because there is no param for a storage profile id which now required. Otherwise I would just pass config overrides. Is there an example for batch/background job submission?

ddneilson added a commit to ddneilson/deadline-cloud that referenced this issue Sep 10, 2024
Fixes: aws-deadline#386

Problem:

The customer reports that the config file can get clobbered when running
many bundle submit commands in parallel. When clobbered, the config file
will only contain the job-id for the a submitted job; all of the farm, queue, etc
information will be gone.

Solution:

A standard pattern for concurrent file modification is to write changes to a temp file,
and then move that temp file overtop of the config file via a filesystem rename operation.
The rename is atomic, so that prevents the file content from being clobbered.

Signed-off-by: Daniel Neilson <[email protected]>
ddneilson added a commit to ddneilson/deadline-cloud that referenced this issue Sep 10, 2024
Fixes: aws-deadline#386

Problem:

The customer reports that the config file can get clobbered when running
many bundle submit commands in parallel. When clobbered, the config file
will only contain the job-id for the a submitted job; all of the farm, queue, etc
information will be gone.

Solution:

A standard pattern for concurrent file modification is to write changes to a temp file,
and then move that temp file overtop of the config file via a filesystem rename operation.
The rename is atomic, so that prevents the file content from being clobbered.

Signed-off-by: Daniel Neilson <[email protected]>
ddneilson added a commit to ddneilson/deadline-cloud that referenced this issue Sep 10, 2024
Fixes: aws-deadline#386

Problem:

The customer reports that the config file can get clobbered when running
many bundle submit commands in parallel. When clobbered, the config file
will only contain the job-id for the a submitted job; all of the farm, queue, etc
information will be gone.

Solution:

A standard pattern for concurrent file modification is to write changes to a temp file,
and then move that temp file overtop of the config file via a filesystem rename operation.
The rename is atomic, so that prevents the file content from being clobbered.

Signed-off-by: Daniel Neilson <[email protected]>
@ddneilson
Copy link
Contributor

Thanks for the bug report. The fix has been merged and will go out with the next release.

We're also adding --storage-profile-id as an option to the bundle submit subcommand in #442

That should go out at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants