-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-1520] Boto3 S3Hook, S3Log #2532
Conversation
@NielsZeilemaker, thanks for your PR! By analyzing the history of the files in this pull request, we identified @artwr, @criccomini and @skudriashev to be potential reviewers. |
I didn't replace the boto2 S3Hook, as it returns boto2 datatypes which aren't compatible with boto3. |
@NeckBeardPrince, please fix tests. |
airflow/utils/logging.py
Outdated
@@ -60,7 +60,8 @@ class S3Log(object): | |||
def __init__(self): | |||
remote_conn_id = configuration.get('core', 'REMOTE_LOG_CONN_ID') | |||
try: | |||
from airflow.hooks.S3_hook import S3Hook | |||
from airflow.contrib.hooks.s3_hook import S3Hook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, it doesn't seem to be good practice to start relying on contrib in core.
@aoen should we upgrade the s3 hook in core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed about not relying on contrib in core. It will take a while but we can move any dependencies on contrib into abstractions like you and Allison did for logging. This change seems a bit risky to me, I'm not sure if they are 100% compatible (though I'm not blocking it). We might want the abstraction to be created first.
bbd6397
to
4a014c3
Compare
I've updated the pull request to fix the tests, and move the S3Hook from contrib to the normal core. Tests should be fixed as well |
4a014c3
to
6e19146
Compare
Codecov Report
@@ Coverage Diff @@
## master #2532 +/- ##
==========================================
+ Coverage 71.8% 72.33% +0.52%
==========================================
Files 154 154
Lines 11895 11786 -109
==========================================
- Hits 8541 8525 -16
+ Misses 3354 3261 -93
Continue to review full report at Codecov.
|
Can you rebase please, so we can have a look at merging? |
Perhaps the aws_hook (which uses boto3) should be promoted to core as well, given it should be super to all things AWS. Looking forward to this one, I am trying to create a generic DB to s3 operator and having some issues with the boto2 s3 hook. |
I've also had issues with the boto(2) version of the S3 hook and am using boto3 otherwise. I understand there is some backwards incompatibility from boto to boto3 but boto is barely maintained these days (vs boto3). It would be great to see this promoted to a core component. |
I'd like to get it in first. But it needs to be rebased though. We can promote it afterwards. |
@NielsZeilemaker Is this something that you're still working on? We're interested in the same feature so happy to push this over the finish line if needed. |
@tedmiston I would pick it up, the window to include it in 1.9.0 is closing this weekend |
Sorry guys, I was on a holiday. I'll pick it up, but.. I can't checkout master at the moment as it contains a file with a filename which cannot be created on windows. (I seem to be the only guy on windows I guess) |
Happy to help with this, would love to add additional features over hooks with boto3 |
I've created a pull request for the filename fix, #2673 if that is merged I can go ahead a rebase this one |
airflow/contrib/hooks/aws_hook.py
Outdated
@@ -26,8 +26,8 @@ class AwsHook(BaseHook): | |||
""" | |||
def __init__(self, aws_conn_id='aws_default'): | |||
self.aws_conn_id = aws_conn_id | |||
|
|||
def get_client_type(self, client_type, region_name=None): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these whitespaces.
the old s3 hook allowed for specifying the key and secret in the extra field. should be fall back to that so as to not break things? ( I personally prefer the login and password as you have done) Would it make sense to hijack another field to save off region in? host maybe? Also if we are going to use login and password as the key and secret we should label them as such. |
Is there a generic flask template for showing connection info fields? Perhaps its easy to relabel some fields on the form. Personally I think using the username & password is a good idea, because at least then the secret can use the fernet encryption in the meta DB and not be visible in the UI. Generally with boto/boto3 and other aws sdk's there is a hierarchy of places to retrieve credentials. First check if they are provided as arguments, then check environment variables, then check config file and then fall back to IAM role. I personally use IAM role here because it's the easiest to provision and maintain, and is transparent to the application. The airflow documentation for s3 connection should recommend IAM, assuming the server is in an AWS VPC. IAM is magic. For this solution, I would recommend following a similar fallback, i.e. use the username & password field, otherwise fall back to the "extra params" field, otherwise let boto do the remaining fallback options and bubble up any exceptions. I've personally been hacking at the s3hook and saved it as a plugin in my environment because I am running server side encryption (kms) on my s3 buckets and a policy enforcing use of encryption, and the existing s3hook doesn't cater to that. I have it working well, but given current hook uses original boto, and this PR is pending there's no point in me making a PR for this yet. I've exposed the "headers" parameter of the load_files method which will accept a dict of extra API headers. My approach to this is to let de_json pop all the expected extra params (region, key etc) and then whatever is left throw into the "headers". Specifically I am looking at the 3 that start with "x-amz-server-side-encryption" but theres a whole bunch of possible headers you can send with s3 requests e.g. custom metadata. Hopefully this PR will accommodate that, if not I will wait till it gets merged and look at submitting a change after that. |
@davoscollective - Personally the auto magic auth fallbacks in boto have bitten me several times. Where it ends up pulling personal or work creds from my home directory when I want the other one. I like the idea of falling back from args to environment variables as long as the fallbacks are explicit and documented. The fallback to AWS creds in the home directory has been confusing to me though. We currently use a forked version of S3Hook that pulls the access key and secret from login and password as well as discussed above. That said, I modified it to fall back to login / password and still use extras first to maintain backwards compatibility. Personally I think we should strive for the same here, but I could see the argument for best practices replacing the backwards compatible version. |
Whether you take extras first or not, as long as its in the chain then backwards compatibility is maintained. The reason I liked user/login first is because when you look at the UI it is immediately obvious that those fields are populated (at least the username is shown) , whereas extras takes a bit longer to grep, especially when it's full of other values not just key/secret. In my case I am running a dev & prod airflow server, and they run as the airflow user with minimal privileges and have no creds in the home folder and no environment variables so can't be bitten the way you describe, although I can see how that might happen on your local Dev environment if you run the airflow service with your own username. There may be people using airflow for personal work on their own machines too. Agreed the fallback should be documented, happy to contribute to that. |
When you get a chance, can you rebase? I'm working on PR's branched from here, and I believe there are some failing tests that will be fixed when this PR is updated with latest master! |
@ahh2131, unfortunately I can't as I can't pull master due to the windows colon thingy. It hasn't been merged. |
Ah gotcha gotcha! |
Can you rebase / squash please @NielsZeilemaker ? Then we can have another look |
4dd60e4
to
a6a9919
Compare
a6a9919
to
bab26a4
Compare
@bolkedebruin I've rebased and fixed the tests. One possible change I would suggest is to move the aws_hook from contrib to core, to not let the S3Hook import from contrib. If you agree, I'll add a placeholder with a deprecated warning in contrib to not break existing code. |
@aoen @saguziel what do you guys think? I think this makes sense and there seems to be demand for it. @NielsZeilemaker for me one condition is important: you babysit this through to the 1.9 release. Furthermore make sure that your code is covered with tests, 21% inly now. Coverage should increase (or stay the same) when moving to core. |
aws_secret_access_key = connection_object.extra_dejson['aws_secret_access_key'] | ||
|
||
elif 's3_config_file' in connection_object.extra_dejson: | ||
aws_access_key_id, aws_secret_access_key = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use () for multilines
aws_access_key_id = None | ||
aws_secret_access_key = None | ||
def get_client_type(self, client_type, region_name=None): | ||
aws_access_key_id, aws_secret_access_key, region_name, endpoint_url = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
|
||
from airflow.exceptions import AirflowException | ||
from airflow.hooks.base_hook import BaseHook | ||
|
||
|
||
def _parse_s3_config(config_file_name, config_format='boto', profile=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the test to this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is none, it's legacy code I moved from the S3hook
Regarding the babysitting, I moved to a different client who doesn't use aws. We have been using this code to log to S3 and it works, but currently I have no way to test it. |
I'm working on multiple PR's off of this one, and will be using this boto3 update in prod once its ready - I can babysit to 1.9 if that's helpful. |
@davoscollective The edge case that breaks backwards compatibility in the setup you described is if both login + password and extras are provided (and are different). We can babysit this as well. We have a good amount of S3 activity and I'd like to get it rolling in our prod environment. @bolkedebruin Does test coverage need to be increased pre-merge or would post-merge with follow up PRs be okay? |
@tedmiston follow up is fine. Ok merging assuming babysitting and follow up PRs. @tedmiston cc @aoen @criccomini |
Closes #2532 from NielsZeilemaker/AIRFLOW-1520 (cherry picked from commit cd3ad3f) Signed-off-by: Bolke de Bruin <[email protected]>
This PR removed two quiet important things:
Are there any plans/issues to address/reintroduce this? |
@simonvanderveldt have you tested that it doesn't? My understanding is that the upgrade from boto to boto3 library means it falls back to assume the environment variables and failing that the host IAM role implicitly. I will double check that when I am at a computer. |
@davoscollective Thanks for the quick response. And yeah, I have tested it and assumerole based credentials don't work, that's why I started looking into the code. What might be the case regarding the fallback to boto3 credential handling is that it does actually work but since it doesn't do any assume role call I have to assumerole myself and set the credentials that that call returns. I'll give that a try and report back. |
Are you running on an EC2 with an IAM role attached that has a policy that allows writing to the S3 bucket? |
I reread those docs about boto3 credentials and I think the only valid use case for assume role or temporary sessions is where you need airflow to access a different AWS account from where it is hosted, perhaps an account belonging to another organization or division within the same organization. If that's not the case IAM roles are the recommended best practice, and a lot simpler to manage. That being said there will be some that have that use case so removing this ability is not ideal. |
No, I'm running Airflow locally in a docker container and use assumerole to access resources within an AWS account. I'll provide some context, sorry for the wall of text. The fact that I'm running Airflow in a container makes the regular way of using assumerole (adding a profile to Anyway, the fallback to boto3 credentials is working, I did the assumerole call myself using AWS cli and set the
No, there are no details for the errors, mainly because of this nice line :x |
Closes apache#2532 from NielsZeilemaker/AIRFLOW-1520
Dear Airflow maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
I've implemented a contrib.S3Hook which uses boto3. And changed the logging.utils.S3Log to use this hook.
Tests
Commits