Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-1520] Boto3 S3Hook, S3Log #2532

Closed
wants to merge 1 commit into from

Conversation

NielsZeilemaker
Copy link
Contributor

Dear Airflow maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    I've implemented a contrib.S3Hook which uses boto3. And changed the logging.utils.S3Log to use this hook.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@mention-bot
Copy link

@NielsZeilemaker, thanks for your PR! By analyzing the history of the files in this pull request, we identified @artwr, @criccomini and @skudriashev to be potential reviewers.

@NielsZeilemaker
Copy link
Contributor Author

I didn't replace the boto2 S3Hook, as it returns boto2 datatypes which aren't compatible with boto3.
I think the best option would be to add deprecation warnings

@skudriashev
Copy link
Contributor

@NeckBeardPrince, please fix tests.

@@ -60,7 +60,8 @@ class S3Log(object):
def __init__(self):
remote_conn_id = configuration.get('core', 'REMOTE_LOG_CONN_ID')
try:
from airflow.hooks.S3_hook import S3Hook
from airflow.contrib.hooks.s3_hook import S3Hook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, it doesn't seem to be good practice to start relying on contrib in core.

@aoen should we upgrade the s3 hook in core?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed about not relying on contrib in core. It will take a while but we can move any dependencies on contrib into abstractions like you and Allison did for logging. This change seems a bit risky to me, I'm not sure if they are 100% compatible (though I'm not blocking it). We might want the abstraction to be created first.

@NielsZeilemaker
Copy link
Contributor Author

I've updated the pull request to fix the tests, and move the S3Hook from contrib to the normal core.
It does however still depend on the awk_hook from contrib.

Tests should be fixed as well

@codecov-io
Copy link

codecov-io commented Aug 21, 2017

Codecov Report

Merging #2532 into master will increase coverage by 0.52%.
The diff coverage is 21.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2532      +/-   ##
==========================================
+ Coverage    71.8%   72.33%   +0.52%     
==========================================
  Files         154      154              
  Lines       11895    11786     -109     
==========================================
- Hits         8541     8525      -16     
+ Misses       3354     3261      -93
Impacted Files Coverage Δ
airflow/hooks/S3_hook.py 30.95% <21.42%> (+9.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b464d23...bab26a4. Read the comment docs.

@bolkedebruin
Copy link
Contributor

Can you rebase please, so we can have a look at merging?

@ghost
Copy link

ghost commented Sep 23, 2017

Perhaps the aws_hook (which uses boto3) should be promoted to core as well, given it should be super to all things AWS. Looking forward to this one, I am trying to create a generic DB to s3 operator and having some issues with the boto2 s3 hook.

@tedmiston
Copy link
Contributor

tedmiston commented Sep 23, 2017

I've also had issues with the boto(2) version of the S3 hook and am using boto3 otherwise. I understand there is some backwards incompatibility from boto to boto3 but boto is barely maintained these days (vs boto3). It would be great to see this promoted to a core component.

@bolkedebruin
Copy link
Contributor

I'd like to get it in first. But it needs to be rebased though. We can promote it afterwards.

@tedmiston
Copy link
Contributor

@NielsZeilemaker Is this something that you're still working on? We're interested in the same feature so happy to push this over the finish line if needed.

@bolkedebruin
Copy link
Contributor

@tedmiston I would pick it up, the window to include it in 1.9.0 is closing this weekend

@NielsZeilemaker
Copy link
Contributor Author

Sorry guys, I was on a holiday. I'll pick it up, but.. I can't checkout master at the moment as it contains a file with a filename which cannot be created on windows. (I seem to be the only guy on windows I guess)
I'll create a pull request to change the name of the file.

@andyxhadji
Copy link
Contributor

Happy to help with this, would love to add additional features over hooks with boto3

@NielsZeilemaker
Copy link
Contributor Author

I've created a pull request for the filename fix, #2673 if that is merged I can go ahead a rebase this one

@@ -26,8 +26,8 @@ class AwsHook(BaseHook):
"""
def __init__(self, aws_conn_id='aws_default'):
self.aws_conn_id = aws_conn_id

def get_client_type(self, client_type, region_name=None):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these whitespaces.

@dalupus
Copy link
Contributor

dalupus commented Oct 9, 2017

the old s3 hook allowed for specifying the key and secret in the extra field.

should be fall back to that so as to not break things? ( I personally prefer the login and password as you have done)

Would it make sense to hijack another field to save off region in? host maybe?

Also if we are going to use login and password as the key and secret we should label them as such.

@ghost
Copy link

ghost commented Oct 10, 2017

Is there a generic flask template for showing connection info fields? Perhaps its easy to relabel some fields on the form. Personally I think using the username & password is a good idea, because at least then the secret can use the fernet encryption in the meta DB and not be visible in the UI.

Generally with boto/boto3 and other aws sdk's there is a hierarchy of places to retrieve credentials. First check if they are provided as arguments, then check environment variables, then check config file and then fall back to IAM role. I personally use IAM role here because it's the easiest to provision and maintain, and is transparent to the application. The airflow documentation for s3 connection should recommend IAM, assuming the server is in an AWS VPC. IAM is magic.

For this solution, I would recommend following a similar fallback, i.e. use the username & password field, otherwise fall back to the "extra params" field, otherwise let boto do the remaining fallback options and bubble up any exceptions.

I've personally been hacking at the s3hook and saved it as a plugin in my environment because I am running server side encryption (kms) on my s3 buckets and a policy enforcing use of encryption, and the existing s3hook doesn't cater to that. I have it working well, but given current hook uses original boto, and this PR is pending there's no point in me making a PR for this yet.

I've exposed the "headers" parameter of the load_files method which will accept a dict of extra API headers. My approach to this is to let de_json pop all the expected extra params (region, key etc) and then whatever is left throw into the "headers". Specifically I am looking at the 3 that start with "x-amz-server-side-encryption" but theres a whole bunch of possible headers you can send with s3 requests e.g. custom metadata. Hopefully this PR will accommodate that, if not I will wait till it gets merged and look at submitting a change after that.

@tedmiston
Copy link
Contributor

tedmiston commented Oct 10, 2017

@davoscollective - Personally the auto magic auth fallbacks in boto have bitten me several times. Where it ends up pulling personal or work creds from my home directory when I want the other one. I like the idea of falling back from args to environment variables as long as the fallbacks are explicit and documented. The fallback to AWS creds in the home directory has been confusing to me though.

We currently use a forked version of S3Hook that pulls the access key and secret from login and password as well as discussed above. That said, I modified it to fall back to login / password and still use extras first to maintain backwards compatibility. Personally I think we should strive for the same here, but I could see the argument for best practices replacing the backwards compatible version.

@ghost
Copy link

ghost commented Oct 11, 2017

Whether you take extras first or not, as long as its in the chain then backwards compatibility is maintained. The reason I liked user/login first is because when you look at the UI it is immediately obvious that those fields are populated (at least the username is shown) , whereas extras takes a bit longer to grep, especially when it's full of other values not just key/secret.

In my case I am running a dev & prod airflow server, and they run as the airflow user with minimal privileges and have no creds in the home folder and no environment variables so can't be bitten the way you describe, although I can see how that might happen on your local Dev environment if you run the airflow service with your own username. There may be people using airflow for personal work on their own machines too. Agreed the fallback should be documented, happy to contribute to that.

@andyxhadji
Copy link
Contributor

When you get a chance, can you rebase? I'm working on PR's branched from here, and I believe there are some failing tests that will be fixed when this PR is updated with latest master!

@NielsZeilemaker
Copy link
Contributor Author

@ahh2131, unfortunately I can't as I can't pull master due to the windows colon thingy. It hasn't been merged.

@andyxhadji
Copy link
Contributor

Ah gotcha gotcha!

@bolkedebruin
Copy link
Contributor

Can you rebase / squash please @NielsZeilemaker ? Then we can have another look

@NielsZeilemaker NielsZeilemaker force-pushed the AIRFLOW-1520 branch 2 times, most recently from 4dd60e4 to a6a9919 Compare October 19, 2017 07:37
@NielsZeilemaker
Copy link
Contributor Author

@bolkedebruin I've rebased and fixed the tests. One possible change I would suggest is to move the aws_hook from contrib to core, to not let the S3Hook import from contrib. If you agree, I'll add a placeholder with a deprecated warning in contrib to not break existing code.

@bolkedebruin
Copy link
Contributor

bolkedebruin commented Oct 19, 2017

@aoen @saguziel what do you guys think? I think this makes sense and there seems to be demand for it.

@NielsZeilemaker for me one condition is important: you babysit this through to the 1.9 release. Furthermore make sure that your code is covered with tests, 21% inly now. Coverage should increase (or stay the same) when moving to core.

aws_secret_access_key = connection_object.extra_dejson['aws_secret_access_key']

elif 's3_config_file' in connection_object.extra_dejson:
aws_access_key_id, aws_secret_access_key = \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use () for multilines

aws_access_key_id = None
aws_secret_access_key = None
def get_client_type(self, client_type, region_name=None):
aws_access_key_id, aws_secret_access_key, region_name, endpoint_url = \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem


from airflow.exceptions import AirflowException
from airflow.hooks.base_hook import BaseHook


def _parse_s3_config(config_file_name, config_format='boto', profile=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the test to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is none, it's legacy code I moved from the S3hook

@NielsZeilemaker
Copy link
Contributor Author

Regarding the babysitting, I moved to a different client who doesn't use aws. We have been using this code to log to S3 and it works, but currently I have no way to test it.

@andyxhadji
Copy link
Contributor

I'm working on multiple PR's off of this one, and will be using this boto3 update in prod once its ready - I can babysit to 1.9 if that's helpful.

@tedmiston
Copy link
Contributor

tedmiston commented Oct 19, 2017

@davoscollective The edge case that breaks backwards compatibility in the setup you described is if both login + password and extras are provided (and are different).

We can babysit this as well. We have a good amount of S3 activity and I'd like to get it rolling in our prod environment.

@bolkedebruin Does test coverage need to be increased pre-merge or would post-merge with follow up PRs be okay?

@bolkedebruin
Copy link
Contributor

@tedmiston follow up is fine.

Ok merging assuming babysitting and follow up PRs. @tedmiston cc @aoen @criccomini

@simonvanderveldt
Copy link

This PR removed two quiet important things:

Are there any plans/issues to address/reintroduce this?

@ghost
Copy link

ghost commented Jan 16, 2018

@simonvanderveldt have you tested that it doesn't? My understanding is that the upgrade from boto to boto3 library means it falls back to assume the environment variables and failing that the host IAM role implicitly. I will double check that when I am at a computer.

@simonvanderveldt
Copy link

simonvanderveldt commented Jan 16, 2018

@simonvanderveldt have you tested that it doesn't? My understanding is that the upgrade from boto to boto3 library means it falls back to assume the environment variables and failing that the host IAM role implicitly. I will double check that when I am at a computer.

@davoscollective Thanks for the quick response. And yeah, I have tested it and assumerole based credentials don't work, that's why I started looking into the code.
That one is definitely not working because it has been removed in its entirety :)

What might be the case regarding the fallback to boto3 credential handling is that it does actually work but since it doesn't do any assume role call I have to assumerole myself and set the credentials that that call returns. I'll give that a try and report back.

@ghost
Copy link

ghost commented Jan 16, 2018

Are you running on an EC2 with an IAM role attached that has a policy that allows writing to the S3 bucket?
http://boto3.readthedocs.io/en/latest/guide/configuration.html
In the docs there's a list of 8 places boto3 looks for credentials. #6 is an explicit assume role provider, number 8 is EC2 instance IAM metadata.
I wonder if you don't have some other thing preventing S3 access. Do you see any specific errors?

@ghost
Copy link

ghost commented Jan 16, 2018

I reread those docs about boto3 credentials and I think the only valid use case for assume role or temporary sessions is where you need airflow to access a different AWS account from where it is hosted, perhaps an account belonging to another organization or division within the same organization. If that's not the case IAM roles are the recommended best practice, and a lot simpler to manage. That being said there will be some that have that use case so removing this ability is not ideal.

@simonvanderveldt
Copy link

simonvanderveldt commented Jan 16, 2018

Are you running on an EC2 with an IAM role attached that has a policy that allows writing to the S3 bucket?
http://boto3.readthedocs.io/en/latest/guide/configuration.html
In the docs there's a list of 8 places boto3 looks for credentials. #6 is an explicit assume role provider, number 8 is EC2 instance IAM metadata.

No, I'm running Airflow locally in a docker container and use assumerole to access resources within an AWS account. I'll provide some context, sorry for the wall of text.

The fact that I'm running Airflow in a container makes the regular way of using assumerole (adding a profile to ~/.aws/config and passing that profile's name to boto3.client() or setting the AWS_PROFILE env var) pretty impractical/impossible because I can't properly volume mount my ~/.aws/config. So Airflow's option to use assumerole by setting the role-arn was very welcome.
For people who want to use assumerole when running Airflow locally it would be enough for them to set AWS_PROFILE (or the option to set the profile name in the extra config should be added again and Airflow does the assumerole call).

Anyway, the fallback to boto3 credentials is working, I did the assumerole call myself using AWS cli and set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN env vars based on that. After that logging to S3 worked using these credentials.
Doing this is not ideal because I have to do the assumerole call myself and then convert the response from AWS to env vars and the credentials time out whereas if you leave the assumerole action to boto3 it gets renewed automatically.

I wonder if you don't have some other thing preventing S3 access. Do you see any specific errors?

No, there are no details for the errors, mainly because of this nice line :x
https://github.com/apache/incubator-airflow/blob/4ce4faaeae7a76d97defcf9a9d3304ac9d78b9bd/airflow/utils/log/s3_task_handler.py#L166

stlava pushed a commit to Nextdoor/airflow that referenced this pull request Sep 4, 2019
Closes apache#2532 from NielsZeilemaker/AIRFLOW-1520
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants