Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Credentials present in logs when using Shared Key auth to Azure Blob Storage #19883

Closed
2 tasks done
josh-fell opened this issue Nov 30, 2021 · 7 comments
Closed
2 tasks done
Assignees
Labels

Comments

@josh-fell
Copy link
Contributor

josh-fell commented Nov 30, 2021

Apache Airflow Provider(s)

microsoft-azure

Versions of Apache Airflow Providers

Latest of all providers available on Airflow main.

Apache Airflow version

main (development)

Operating System

Debian GNU/Linux 10 (buster)

Deployment

Other

Deployment details

Using Breeze on main branch.

What happened

When authenticating to Azure Blob Storage using the "Shared Key" method (which only includes the shared access key and storage account URL), both the key value and URL are printed in plain text to the task logs.

Example log entry:

[2021-11-29, 21:57:53 EST] {base.py:79} INFO - Using connection to: id: wasb_default. Host: https://myAccountBlahBlah.blob.core.windows.net/, Port: None, Schema: , Login: , Password: None, extra: {'extra__wasb__connection_string': '', 'extra__wasb__sas_token': '***', 'extra__wasb__shared_access_key': 'KEa1QvMjxMNLuzTgVZFvS6PpQv087Ls0Oq+7Ic/fa9Lu3RQwunHi61yZTCJSCIo1gZBNLuzTgVZFvS6PpQv087Ls0Oq+7Ic/fa9Lu3RQwunHi61yZTCJSCIo1gZBH1cc/KZb3EAKXrqWXXXXXXXXXXXXXXXXXXXXXX==', 'extra__wasb__tenant_id': ''}

What you expected to happen

The entirety of the credentials used to authenticate to Azure Blob Storage should not be visible in plain text within the task logs.

How to reproduce

  • Create an Azure Blob Storage connection populating the "Shared Access Key" and "Account Name" fields in the UI or creating a connection using the shared_access_key/extra__wasb__shared_access_key extra and host.
  • Use a simplified version of the existing example_local_to_wasb DAG in the Azure provider:
import os
from datetime import datetime

from airflow.models import DAG
from airflow.providers.microsoft.azure.transfers.local_to_wasb import LocalFilesystemToWasbOperator

PATH_TO_UPLOAD_FILE = os.environ.get('AZURE_PATH_TO_UPLOAD_FILE', 'example-text.txt')

with DAG(
    "example_local_to_wasb",
    schedule_interval="@once",
    start_date=datetime(2021, 1, 1),
    catchup=False,
    default_args={"container_name": "mycontainer", "blob_name": "myblob"},
) as dag:
    upload = LocalFilesystemToWasbOperator(
        task_id="upload_file", file_path=PATH_TO_UPLOAD_FILE, load_options={"overwrite": True}
    )

Anything else

While a fix is relatively straightforward, feedback on the best approach would be appreciated. There seem to be a few options:

  • Update the WasbHook to use the password connection field instead of the 'shared_access_keyorextra__wasb__shared_access_key` extras.
    • Suggested by @ashb in a discussion on Add "access_key" to DEFAULT_SENSITIVE_FIELDS #19497
    • Obviously some tradeoff between custom connection fields/extras making connection creation (presumably) more straightforward/targeted vs reusing core connection fields such that they take on multiple meanings and perhaps more "burdensome" to create said connection.
    • What does backwards compat look like in this case? Or is it valid to call this out as a "breaking change"?
  • Add "shared_access_key" to DEFAULT_SENSITIVE_FIELDS to ensure the shared key value is masked in the connection extras when logged
    • Perhaps a bit heavy-handed of a change
  • Update the logging logic in airflow.hooks.base.get_connection() to only log what connection ID is being used rather than all of the connection details.
    • Unclear what the user impact would be if folks use these particular details in log analysis or simply find it handy to have in the logs.
    • (Slightly-related side question: why is the logging only written when host is provided in a connection?)

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@josh-fell
Copy link
Contributor Author

josh-fell commented Dec 7, 2021

@eladkal Any feedback on any of the fix options listed above? I'm inclined to go with Option 1 but this would be a breaking change. I'm not sure when those become justified. Also Option 3 is intriguing but it does have some questions that come along with and is more widespread in its effect.

@eladkal
Copy link
Contributor

eladkal commented Dec 7, 2021

@josh-fell why don't we have the same problem with Security Token field in Salesforce connection?

@josh-fell
Copy link
Contributor Author

@josh-fell why don't we have the same problem with Security Token field in Salesforce connection?

For two reasons:

  1. The string "token" is already included in DEFAULT_SENSITIVE_FIELDS in the secrets masker logic. Any extra with that string in its key would be masked.
  2. The logging line that is creating this issue is only written when host is provided in a connection. This brings me to the "intriguing" part: Why is this only logged when host is provided? And is it necessary to log all of the details of the connection rather than just what connection ID is used?

@potiuk
Copy link
Member

potiuk commented Dec 7, 2021

Actually I'd do all three options together:

  • deprecate (not remove) shared_access_key and use password if it is not present
  • add shared_acess_key to sensitive fields
    AND
  • change the logic of logging get_connection() to only use ID (but with the caveat that full connection details could be printed at DEBUG level).

@eladkal
Copy link
Contributor

eladkal commented May 29, 2022

@josh-fell is this issue resolved or is there a further task?

@josh-fell
Copy link
Contributor Author

Yeah, I'd still need to implement the other 2 solution options for full coverage to mask shared access key from the logs. But at least the shared access key will only be logged if debug logging so it's probably fine now for the vast majority of users (which I imagine is a much smaller subset using this particular auth anyway).

@josh-fell
Copy link
Contributor Author

Thinking more about this. The update in #21162 really addresses this. If users do feel they need to mask other connection attributes when using debug logging for this particular type of authentication (which by no means is vastly popular), they can still mask those with existing Airflow configs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants