Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support in AWS Batch Operator for multinode jobs #28321

Conversation

camilleanne
Copy link

@camilleanne camilleanne commented Dec 12, 2022

  • Adds support for AWS Batch multinode jobs by allowing a node_overrides json object to be passed through to the boto3 submit_job method.
  • Adds support for multinode jobs by properly parsing the output of describe_jobs (which is different for container vs multinode) to extract the log stream name.

closes: #25522

I had a hard time running tests locally, so I'm opening as a draft PR initially although I don't anticipate any changes beyond syncing with main, but I'd like to confirm test success before making available for review.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Dec 12, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 12, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

Comment on lines 158 to 159
self.container_overrides = overrides
self.node_overrides = node_overrides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. Did you check are this arguments mutually exclusive?
AWS API doesn't mention it however everything might possible because even new eksPropertiesOverride not marked as exclusive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they are, specifying both returns an input validation error.
BTW @Taragolis , could you approve running the workflow to help get this PR ready-to-review ? 🙏

@camilleanne camilleanne marked this pull request as ready for review December 16, 2022 01:26
@camilleanne camilleanne requested a review from eladkal as a code owner December 16, 2022 01:26
@camilleanne
Copy link
Author

Ok I got a handle on all tests finally :) ready for review now.

job_id,
)
log_configuration = (
job_node_range_properties[0].get("container", {}).get("logConfiguration", {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have zero element in the array ? i.e. should we add a check on len == 0 and a user-friendly error message ?

Comment on lines 452 to 453
self.log.warning(
"AWS Batch job (%s) is neither a container nor multinode job. Log info not found."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be an error log, considering the user-provided input is invalid for this kind of request ?

The other warning logs in this method are mostly informative (there are several node groups, which is important info for the user to know, but doesn't require any action), this one I think requires user action, and thus more attention.

Comment on lines +349 to +357
},
}
},
}
],
},
}
]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful 😄

tests/providers/amazon/aws/operators/test_batch.py Outdated Show resolved Hide resolved
@@ -108,12 +110,13 @@ class BatchOperator(BaseOperator):
"job_queue",
"overrides",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL the names here need to match not the parameter name in the ctor but the name of the attribute in the class, so:

Suggested change
"overrides",
"container_overrides",

Thinking about it, I wonder if this is a breaking change... I don't know too well how this works 😬

Copy link
Contributor

@Taragolis Taragolis Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camilleanne @vandonr-amz I recommend deprecate old parameter but keep it for a while, so users would have a time for change their code.
It is much easier achieve in subclass of BaseOperator (this PR case), because all arguments are keyword.

Some examples

Use old attribute value only if new attribute not set

http_conn_id = kwargs.pop("http_conn_id", None)
if http_conn_id:
warnings.warn(
"Parameter `http_conn_id` is deprecated. Please use `slack_webhook_conn_id` instead.",
DeprecationWarning,
stacklevel=2,
)
if slack_webhook_conn_id:
raise AirflowException("You cannot provide both `slack_webhook_conn_id` and `http_conn_id`.")
slack_webhook_conn_id = http_conn_id

Use old attribute value only if it equal new attribute value or not new attribute not set

if max_tries:
warnings.warn(
f"Parameter `{self.__class__.__name__}.max_tries` is deprecated and will be removed "
"in a future release. Please use method `max_polling_attempts` instead.",
DeprecationWarning,
stacklevel=2,
)
if max_polling_attempts and max_polling_attempts != max_tries:
raise Exception("max_polling_attempts must be the same value as max_tries")
else:
self.max_polling_attempts = max_tries

Copy link
Contributor

@vandonr-amz vandonr-amz Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the fact that it was just a rename for readability, I wonder if it wouldn't be simpler to just keep the old name ?
but then it could be confusing to users...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it fine if rename this argument because one day we might also add eks_properties_overrides, so override might confuse users more rather than change attribute name.
And deprecation warning give time to change arguments in end users DAG code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL the names here need to match not the parameter name in the ctor but the name of the attribute in the class

To translate that into ELI5: if you have self.foo = bar in your operator, the template field would be named "foo" to match "self.foo", not "bar". :P

@o-nikolas
Copy link
Contributor

Hey @camilleanne,

Any plans to pick this one up again?

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Mar 13, 2023
@o-nikolas
Copy link
Contributor

@vandonr-amz is taking on this work on in #29522

Closing this PR with that context.

@o-nikolas o-nikolas closed this Mar 13, 2023
dimberman pushed a commit that referenced this pull request Apr 12, 2023
picking up #28321 after it's been somewhat abandoned by the original author.
Addressed my own comment about empty array, and it should be good to go I think.

Initial description from @camilleanne:

Adds support for AWS Batch multinode jobs by allowing a node_overrides json object to be passed through to the boto3 submit_job method.

Adds support for multinode jobs by properly parsing the output of describe_jobs (which is different for container vs multinode) to extract the log stream name.
closes: #25522
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support AWS Batch multinode job types
5 participants