Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Connection creation from Vault parameters #15013

Conversation

davlum
Copy link
Contributor

@davlum davlum commented Mar 25, 2021

Currently using the Vault secrets backends requires that users store
the secrets in connection URI format:
https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#connection-uri-format

Unfortunately the connection URI format is not capable of expressing
all values of the Connection class. In particular the Connection
class allows for arbitrary string values for the extra parameter,
while the URI format requires that this parameter be unnested JSON
so that it can serialize into query parameters.

>>> Connection(conn_id='id', conn_type='http', extra='foobar').get_uri()
[2021-03-25 13:31:07,535] {connection.py:337} ERROR - Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/Users/da.lum/code/python/airflow/airflow/models/connection.py", line 335, in extra_dejson
    obj = json.loads(self.extra)
  File "/nix/store/8kzdflq0v06fq0mh9m2fd73gnyqp57xr-python3-3.7.3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/nix/store/8kzdflq0v06fq0mh9m2fd73gnyqp57xr-python3-3.7.3/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/nix/store/8kzdflq0v06fq0mh9m2fd73gnyqp57xr-python3-3.7.3/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[2021-03-25 13:31:07,535] {connection.py:338} ERROR - Failed parsing the json for conn_id id
'http://'

As shown, the extra data is missing from the return value http://.
Although there is an error logged, this does not help users who were
previously able to store other data.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@davlum davlum force-pushed the feature/add-normal-connection-creation-for-vault-management branch 2 times, most recently from 22defff to 020d26f Compare March 28, 2021 19:38
@dstandish
Copy link
Contributor

dstandish commented Mar 30, 2021

Unfortunately the connection URI format is not capable of expressing
all values of the Connection class. In particular the Connection
class allows for arbitrary string values for the extra parameter,
while the URI format requires that this parameter be unnested JSON
so that it can serialize into query parameters.

That's not true. Or at least I don't think that's true.

We have a convenience method get_uri you can use to produce the URI.

Try creating your connection as desired with init params including your json extra. Then call get_uri on the connection.

Use this URI in vault.

If you have a json value for extra that can not be serialized to airflow URI please post it here.

edit

it is true. currently uri can only store key-value pairs. PR to add support for this: #15100

@nathadfield
Copy link
Collaborator

Unfortunately the connection URI format is not capable of expressing
all values of the Connection class. In particular the Connection
class allows for arbitrary string values for the extra parameter,
while the URI format requires that this parameter be unnested JSON
so that it can serialize into query parameters.

That's not true. Or at least I don't think that's true.

We have a convenience method get_uri you can use to produce the URI.

Try creating your connection as desired with init params including your json extra. Then call get_uri on the connection.

Use this URI in vault.

If you have a json value for extra that can not be serialized to airflow URI please post it here.

@dstandish Even so, might not this be a good feature anyway?

A conn_uri is fine but they can be a bit unwieldy and it is not that straightforward to create one correctly. I would personally also like the ability to store the individual elements of a connection as a secret in Vault/GSM and for Airflow to build the connection at runtime.

@dstandish
Copy link
Contributor

dstandish commented Mar 30, 2021

There is a benefit inherent in having only one perfectly good way to do things.

Sometimes when you want to do something another way it's best kept in a private repo for your company or posted on a blog or your own github.

I would be in favor of only using the airflow conn uri format. But let's see what others think -- I'm just one voice here so don't be too discouraged.

In the meantime I encourage you to try the uri generation function we have. There is documentation on generating the uri in the managing connections doc.

@dstandish
Copy link
Contributor

dstandish commented Mar 30, 2021

I see you added an example where extra is not valid json. Why do you want to do that? Why not just use JSON? E.g. if you put fubar in quotes probably it works

Then when you call extra_dejson it should return just the string fubar

update

tried this and doesn't work. has to be urlencodable json

c = Connection(extra='"hello"')
c.extra_dejson  # works 
c.get_uri()  # does not

@nathadfield
Copy link
Collaborator

There is a benefit inherent in having only one perfectly good way to do things.

Sometimes when you want to do something another way it's best kept in a private repo for your company or posted on a blog.

I would be in favor of only using the airflow conn uri format. But let's see what others think -- I'm just one voice here so don't be too discouraged.

In the meantime I encourage you to try the uri generation function we have. There is documentation on generating the uri in the managing connections doc.

@dstandish Sure, I take your point but it is also nice to have options.

For the record, I have tried the conn_uri method. Its fine but it does require a certain level of knowledge in order to create one. There are a certain category of users who may only ever interact with the Airflow UI but may also need to update a secret used by a connection.

@davlum
Copy link
Contributor Author

davlum commented Mar 30, 2021

@dstandish If you read the description of PR you can see an example of get_uri failing. The fact is that connection URI is a lossy format and is required by Vault integration. As somebody who manages several Airflow instances and had to migrate all the tenants connections into Vault this was a breaking change as users who did not have unnested JSON had their connection broken when calling .get_uri() on it.

It was then required to go in and manually change the application code for tenants that used their extra parameters to use the appropriate format. This is not a good workflow.

There is a benefit inherent in having only one perfectly good way to do things.

There does not currently exist one perfectly good way to do things. There are two incompatible ways, and the main interface users use is the Connection object. All values of Connection cannot be serialized to connection URI.

Just to clarify again, the connection URI format doesn't even handle all valid JSON. It only handles unnested JSON.

If you want get_uri to be a legitimate format you need to do validation on the Connection object to verify that it serializes properly. This would be a breaking change.

@dstandish
Copy link
Contributor

Yeah you make a good point that it can't store arbitrary json (I do imagine we could add support for this within the URI format).

Based on that assumption, we do have fairly comprehensive tests that verify that URI parsing and URI generation produce consistent results.

But it's true we don't test the case where you store an arbitrary value in extra such as hello or fubar.

Currently it is assumed dict, so like {'hello': 'fubar'}

Anyway I'll let others chime in at this point.

@avieth
Copy link

avieth commented Mar 30, 2021

In the documentation of the Connection class, it is stated that extra should be encoded JSON.

:param extra: Extra metadata. Non-standard data such as private/SSH keys can be saved here. JSON
encoded object.
:type extra: str

However, AFAICT, the constructor will not check this, and will not fail when extra is not JSON. But get_uri does check this (implicitly, by trying to decode then urlencode), and so we have the very weird case in which we can construct a Connection object for which get_uri fails, as @davlum has shown. I'd say that's fairly counterintuitive and surprising.

Is there a good reason why extra must be encoded JSON? Seems like a nice solution would be to allow for any string there.

@dstandish
Copy link
Contributor

dstandish commented Mar 30, 2021

I was inspired by this conversation to make a PR that adds support for arbitrary json in conn uri format: #15100.

We could also add support of the case of an unquoted arbitrary string value by modifying extra_dejson to return the string rather than empty dict in this case (see note at end of the PR description).

It's orthogonal to this PR but obviously a closely related topic.

Is there a good reason why extra must be encoded JSON? Seems like a nice solution would be to allow for any string there.

I think that the reason is to provide a reliable interface so that you know that extra params are accessible in a dictionary at property extra_dejson.

However, AFAICT, the constructor will not check this, and will not fail when extra is not JSON. But get_uri does check this (implicitly, by trying to decode then urlencode), and so we have the very weird case in which we can construct a Connection object for which get_uri fails, as @davlum has shown. I'd say that's fairly counterintuitive and surprising.

Yeah probably it makes sense to enforce that extra is json in init but this would be breaking. I'll think about supporting the arbitrary and non-json-encoded string in #15100

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dstandish
Copy link
Contributor

@avieth

However, AFAICT, the constructor will not check this, and will not fail when extra is not JSON. But get_uri does check this (implicitly, by trying to decode then urlencode), and so we have the very weird case in which we can construct a Connection object for which get_uri fails, as @davlum has shown. I'd say that's fairly counterintuitive and surprising.
Is there a good reason why extra must be encoded JSON? Seems like a nice solution would be to allow for any string there

OK i was able to add support in get_uri for arbitrary i.e. non-json string in #15100. behavior is still to return {} when you call extra_dejson in this case. However, it can now be serialized to URI and back without loss.

@mik-laj
Copy link
Member

mik-laj commented Mar 31, 2021

Is there a good reason why extra must be encoded JSON? Seems like a nice solution would be to allow for any string there.

We expect the extra field to be a dictionary in Web UI. The use of primitive types also causes a problem with extending a given connection ie. it is not future-proff. We also have a design misunderstood here. This field should contain extra fields, not additional data.

@davlum
Copy link
Contributor Author

davlum commented Mar 31, 2021

@mik-laj

We expect the extra field to be a dictionary in Web UI.

This must be as of Airflow 2.0, because this isn't the case in 1.10.x. The additional validation should have been added on the Connection object because it is still perfectly valid to add any string data to the extra column. Furthermore anyone migrating from 1.10.x -> 2.0+ will have to manually migrate their data.

@mik-laj
Copy link
Member

mik-laj commented Mar 31, 2021

I would be in favor of only using the airflow conn uri format. But let's see what others think -- I'm just one voice here so don't be too discouraged.

I like the idea of supporting other formats. LocalFilesystemBackend supports already object representation also:

def _create_connection(conn_id: str, value: Any):
"""Creates a connection based on a URL or JSON object."""
from airflow.models.connection import Connection
if isinstance(value, str):
return Connection(conn_id=conn_id, uri=value)
if isinstance(value, dict):
connection_parameter_names = get_connection_parameter_names() | {"extra_dejson"}
current_keys = set(value.keys())
if not current_keys.issubset(connection_parameter_names):
illegal_keys = current_keys - connection_parameter_names
illegal_keys_list = ", ".join(illegal_keys)
raise AirflowException(
f"The object have illegal keys: {illegal_keys_list}. "
f"The dictionary can only contain the following keys: {connection_parameter_names}"
)
if "extra" in value and "extra_dejson" in value:
raise AirflowException(
"The extra and extra_dejson parameters are mutually exclusive. "
"Please provide only one parameter."
)
if "extra_dejson" in value:
value["extra"] = json.dumps(value["extra_dejson"])
del value["extra_dejson"]
if "conn_id" in current_keys and conn_id != value["conn_id"]:
raise AirflowException(
f"Mismatch conn_id. "
f"The dictionary key has the value: {value['conn_id']}. "
f"The item has the value: {conn_id}."
)
value["conn_id"] = conn_id
return Connection(**value)
raise AirflowException(
f"Unexpected value type: {type(value)}. The connection can only be defined using a string or object."
)

@mik-laj
Copy link
Member

mik-laj commented Mar 31, 2021

The additional validation should have been added on the Connection object

Yes. I think we should add it. Can you help with it?

@davlum
Copy link
Contributor Author

davlum commented Mar 31, 2021

Can you help with it?

Sure thing, I'll open another PR for that. Curious what the release target will be because it will be a breaking change.

So just to summarize where we're at:

  • Support object backend on Vault in this PR?
  • Add validation on extra column in another PR

@mik-laj
Copy link
Member

mik-laj commented Mar 31, 2021

Curious what the release target will be because it will be a breaking change.

This will be the release manager's decision, but I would consider that this is not a breaking change as the design assumption of this feature was to store objects that have a name and a value. Now we only improve this feature by adding validation. So I think, we should release it in Airflow 2.1.

. Support object backend on Vault in this PR?

Yes. Exactly. It would be fantastic if we could separate the common part with LocalFilesystemBackend.

@mik-laj
Copy link
Member

mik-laj commented Mar 31, 2021

Seems like a nice solution would be to allow for any string there.

@avieth. Do you have any use cases that you can't solve with JSON? Object representation is more future-proof because you can always add a new key and tag the old one as deprecaated. It allows for smooth updates.

@dstandish
Copy link
Contributor

dstandish commented Mar 31, 2021

@mik-laj curious what do you think about deprecating the conn uri format and replacing with JSON, or perhaps allowing both globally? We could implement backward compat with try json parse except conn uri parse

@mik-laj
Copy link
Member

mik-laj commented Mar 31, 2021

I think we support both. The specific choice depends on the specific use case. To be able to use the same value as an environment variable and to copy easily to the clipboard, you can use a URI. for more convenient editing, you can use object representation.

@dstandish
Copy link
Contributor

dstandish commented Mar 31, 2021

I think we support both. The specific choice depends on the specific use case. To be able to use the same value as an environment variable and to copy easily to the clipboard, you can use a URI. for more convenient editing, you can use object representation.

No I mean deprecate across airflow including with env vars and cli ( or support both simultaneously )

There's not really anything special about the uri format. We could store json in env vars

My support of using conn uri is like you said the convenience of switching between backend and env and cli and the virtue of uniformity but if we switch to json globally that could be a good thing

@mik-laj
Copy link
Member

mik-laj commented Mar 31, 2021

I see no reason to do this. Too many users already use URIs. Besides, for simple connections it is a very good representation.

@davlum davlum force-pushed the feature/add-normal-connection-creation-for-vault-management branch from 020d26f to 2fdf132 Compare April 1, 2021 20:03
@davlum davlum requested a review from mik-laj as a code owner April 1, 2021 20:03
@davlum davlum force-pushed the feature/add-normal-connection-creation-for-vault-management branch from 665c855 to c2af3d3 Compare April 2, 2021 23:17
from airflow.providers.hashicorp._internal_client.vault_client import _VaultClient # noqa
from airflow.secrets import BaseSecretsBackend
from airflow.secrets.local_filesystem import _create_connection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should extract the common part to the new package. This is problematic now, so I started discussions on the mailing list. See: https://lists.apache.org/thread.html/r713f180120d0a39b53567812eb5db34f992ec81979818d2175598b71%40%3Cdev.airflow.apache.org%3E

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case I can see this function existing in airflow.models.connections as from_dict() -> Connection. Transforming a dict to the object is the sort of function that usually exists with the class. I agree with the general case that something needs to be done about common functionality between providers.

Copy link
Member

@mik-laj mik-laj Apr 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't add a new feature to the core because we want to be backwards compatible with Airflow 2.0.0. I opened the discussions to loosen this restriction, because they do not conform to reality. I think we should maintain backward compatibility with the MINOR release, not the MAJOR.

I invite you to the discussion on the mailing list if you would like to share your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an additional function on the Connection class so that retains backwards compatibility no?

Copy link
Member

@mik-laj mik-laj Apr 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secret Backend won't work with Airflow 2.0.0, so this is a breaking change to providers packages. This doesn't make breaking changes to Airflow 2.0.0, but to providers packages it does.

Copy link
Contributor

@dstandish dstandish Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are saying that if this vault provider expects a from_dict method on Connection (say, to be added in 2.1) the provider won't work with 2.0, but we've promised that it will.

should we perhaps then just merge this as is? or perhaps duplicate this private function into a vault provider utils module with a note of some kind to replace with Connection.from_dict when it is available?

Copy link
Member

@mik-laj mik-laj Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Exactly.

We should discuss the approach on the mailing list and choose the best solution that everyone accepts. I personally think that we should mark this package as only supported by Airflow 2.1, because trying to maintain backward compatibility will limit our development possibilities. It should be normal for users that new packages/library versions may require a core version if we add new features. See: https://lists.apache.org/thread.html/r713f180120d0a39b53567812eb5db34f992ec81979818d2175598b71%40%3Cdev.airflow.apache.org%3E

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started discussions on this subject on the mailing list, but I haven't checked it recently. I encourage you to discuss it yourself because then we will be able to work out a solution to this problem faster.

https://lists.apache.org/thread.html/r713f180120d0a39b53567812eb5db34f992ec81979818d2175598b71%40%3Cdev.airflow.apache.org%3E

Copy link
Member

@mik-laj mik-laj Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that we have managed to reach a consensus. This change will be in the next provider package after Airflow 2.1 is released. There is a slight problem with the provider packages releases process. We always release providers packages from the main branch, so we cannot merge this PR until we release Airflow 2.1 and we will start releasing providers packages that will be compatible with this version.

In the meantime, we can create a new public method that will create connections from the dictionary. This method will be released in Airflow 2.1. PR: #15425

I added this PR to the Airflow 2.1 milestone.

davlum added a commit to davlum/incubator-airflow that referenced this pull request Apr 8, 2021
Before this commit there was a documented but unenforced
limitation that the extra parameter be encoded JSON.
In apache#15013 this issue garnered attention and motivated
this PR.
@mik-laj mik-laj added this to the Airflow 2.1 milestone Apr 19, 2021
@mik-laj
Copy link
Member

mik-laj commented Apr 19, 2021

We cannot merge this PR until we release Airflow 2.1 and we will start releasing providers packages that will be compatible with this version.

More info: #15013 (comment)

@ashb ashb modified the milestones: Airflow 2.1, Airflow 2.1.1 May 18, 2021
@jhtimmins
Copy link
Contributor

@mik-laj are we able to merge this now? Or is there something else we're waiting on?

@ashb
Copy link
Member

ashb commented May 27, 2021

@jhtimmins Unfortunately not, as #15425 didn't get finished (or possibly reviewed! Sorry) in time to make it for 2.1

@kaxil kaxil modified the milestones: Airflow 2.1.1, Airflow 2.2 Jun 22, 2021
@kaxil
Copy link
Member

kaxil commented Jun 22, 2021

marked it as 2.2 since it is not a bug fix

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natanweinberger Could you take a look and review this PR/tell us what changes are needed now your other change has landed?

@davlum davlum force-pushed the feature/add-normal-connection-creation-for-vault-management branch 4 times, most recently from 8c4aaa3 to faa5008 Compare August 19, 2021 15:45
@ashb
Copy link
Member

ashb commented Aug 19, 2021

(Code on this looks good now! A nice simple change too!)

@davlum davlum force-pushed the feature/add-normal-connection-creation-for-vault-management branch from faa5008 to 1d39db1 Compare August 20, 2021 16:10
Currently using the Vault secrets backends requires that users store
the secrets in connection URI format:
https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#connection-uri-format

Unfortunately the connection URI format is not capable of expressing
all values of the Connection class. In particular the Connection
class allows for arbitrary string values for the  `extra` parameter,
while the URI format requires that this parameter be unnested JSON
so that it can serialize into query parameters.

```
>>> Connection(conn_id='id', conn_type='http', extra='foobar').get_uri()
[2021-03-25 13:31:07,535] {connection.py:337} ERROR - Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/Users/da.lum/code/python/airflow/airflow/models/connection.py", line 335, in extra_dejson
    obj = json.loads(self.extra)
  File "/nix/store/8kzdflq0v06fq0mh9m2fd73gnyqp57xr-python3-3.7.3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/nix/store/8kzdflq0v06fq0mh9m2fd73gnyqp57xr-python3-3.7.3/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/nix/store/8kzdflq0v06fq0mh9m2fd73gnyqp57xr-python3-3.7.3/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[2021-03-25 13:31:07,535] {connection.py:338} ERROR - Failed parsing the json for conn_id id
'http://'
```

As shown, the `extra` data is missing from the return value `http://`.
Although there is an error logged, this does not help users who were
previously able to store other data.
@davlum davlum force-pushed the feature/add-normal-connection-creation-for-vault-management branch from 1d39db1 to fc0dcff Compare August 20, 2021 17:36
@ashb ashb merged commit 6301faa into apache:main Aug 23, 2021
@davlum davlum deleted the feature/add-normal-connection-creation-for-vault-management branch September 24, 2021 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants