-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: emr_conn_id
should be optional in EmrCreateJobFlowOperator
#24306
Conversation
Hey @pankajastro - I am about to release RC2 for providers for May and would like to include that one - would it be possibe to fix the failing tests quickly? |
43ea8bd
to
acc8aa5
Compare
Fixed it, the test passing on local. let's see how it goes in CI. |
@potiuk static check failing in this PR as well in other with error
|
Ah indeed - it looks like some cross-merged commits without rebase .... |
Fix: #24322 |
Is it actually an error? Possibly due to lack of documentation of this connection type As far as I remember since Airflow 1.10 (and probably earlier) Amazon Elastic Map Reduce connection stored only kwargs for run_job_flow and |
I believe there were quite a lot of refactorings to AwsBaseHook since and I guess this is result of it - looks legit for me actually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pankajastro - if you could comment on @Taragolis 's comment too - anyway I am happy to merge it soon and RC2 tests might be good for it.
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
emr_conn_id is used in only EmrCreateJobFlowOperator and I feel it expecting that emr_conn_id extra filed should contain JSON request body for run_job_flow API. But keeping the boto3 API request body in DAG looks more convenient to me than storing it in connection and if I'm passing job_flow_overrides param in the task in that case then emr_conn_id should not be mandatory. Here, I'm doing
|
I've just check in 1.10.4 (I used to work with this version for about 1 year) and looks like it have the same behaviour https://github.com/apache/airflow/blob/1.10.4/airflow/contrib/hooks/emr_hook.py If user stored some parameters in EMR Connection which not accepted by However after this changes:
I've never been happier with behaviour that I need provide emr connection in cases when I do not required it. |
The hook states: airflow/airflow/providers/amazon/aws/hooks/emr.py Lines 27 to 33 in e54ca47
|
If the user will store aws_access_key_id in emr_conn_id extra then I'm assuming that intention to use this connection as authentication rather than using it in building request body. If you want to throw an error like the earlier case I can do that but in current case you can store the AWS JSON in emr_conn_id extra then you do not need to pass aws_conn_id pram in task |
In this case the behaviour will depend on AWS API like earlier |
If I am judging well - I think @eladkal also suggested that - I think it's not a "regression" - rather improvement so I am ok with merging it. |
If only case case make config = {}
if self.emr_conn_id:
emr_conn = self.get_connection(self.emr_conn_id)
config = emr_conn.extra_dejson.copy() Assume that user provide AWS Connection rather than EMR Connection personally for me it is not good point.
If user deploy Airflow in AWS Cloud and follow AWS Best Practices he/she won't use Examples:
For me it is not clear why we only need to check only |
I'm checking this because I do not want a user to use emr_conn_id to authenticate and |
I'm just worry that docstring told that this argument for EMR Connection, but we assume that it could be AWS Connection airflow/airflow/providers/amazon/aws/operators/emr.py Lines 288 to 289 in 01a52cc
I think if we want keep it simpler, allow user to skip set config = {}
if self.emr_conn_id:
emr_conn = self.get_connection(self.emr_conn_id)
if emr_conn.conn_type != 'emr':
raise AirflowException(
f"Expected 'emr' connection type for `emr_conn_id`={self.emr_conn_id} but got {emr_conn.conn_type!r}"
)
config = emr_conn.extra_dejson.copy() WDYT @potiuk @eladkal @pankajastro With this suggestion
|
If you want to keep this behaviour then some users will have to create emr connection having no data, which I don't want. Basically, emr_conn_id should not mandatory. |
I mean if user set I think it also required make changes in operator |
@vincbeck @ferruzzi @o-nikolas any thoughts on this one? |
Ok, So behaviour should be like
We can't keep aws_conn_id and emr_conn_id default values None because if the user do not pass aws_conn_id then we assume that the value is |
9083a03
to
5b07184
Compare
EmrCreateJobFlowOperator
EmrCreateJobFlowOperator
EmrCreateJobFlowOperator
emr_conn_id should
if optional in EmrCreateJobFlowOperator
emr_conn_id should
if optional in EmrCreateJobFlowOperator
emr_conn_id
should be optional in EmrCreateJobFlowOperator
BTW, airflow/airflow/providers/amazon/aws/hooks/base_aws.py Lines 418 to 422 in 01a52cc
So it doesn't raise any error now - only static checks warning in IDE
|
Yes, we can also set it here. but that too will be breaking change for someone not passing aws_conn_id in this operator param and have created a connection with the name |
Quite a lot of operators and sensors in amazon-provider uses different annotations and default value for
Due to the fact all of them (or most) uses hooks based on You could use
My suggestion still the same
|
Then the default value won't be None it will be aws_default, right? |
yep |
So, if I'll change the init method of the operator to
Then from the code execution point of view, this is effectively no change
Then this will be a breaking change if someone has created emr_conn_id with the name emr_default and not passing emr_conn_id param in operator, isn't it? |
Yeah, you right, for example EMR example DAG doesn't set airflow/airflow/providers/amazon/aws/example_dags/example_emr.py Lines 77 to 80 in 047a616
So it would be nice update docstring about behaviour of airflow/airflow/providers/amazon/aws/hooks/base_aws.py Lines 378 to 383 in 047a616
|
f137427
to
140a26d
Compare
Updated docstring here 140a26d |
I unfortunately don't have the time to catch up on this issue today, but dropping in to CC @dacort (who is the AWS Developer Advocate for EMR) |
Hey guys, just wanted to check if I need to address anything more here? |
I was also OOO, but looks like this is already taken care of. Feel free to ping me in the future for any EMR-related questions. |
#Closes: #24318
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragement file, named
{pr_number}.significant.rst
, in newsfragments.