Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make schema in DBApiHook private #17423

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 15 additions & 4 deletions airflow/hooks/dbapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,14 @@ def connect(self, host: str, port: int, username: str, schema: str) -> Any:
# #
#########################################################################################
class DbApiHook(BaseHook):
"""Abstract base class for sql hooks."""
"""
Abstract base class for sql hooks.

:param schema: Optional DB schema that overrides the schema specified in the connection. Make sure that
if you change the schema parameter value in the constructor of the derived Hook, such change
should be done before calling the ``DBApiHook.__init__()``.
:type schema: Optional[str]
"""

# Override to provide the connection name.
conn_name_attr = None # type: str
Expand All @@ -62,7 +69,7 @@ class DbApiHook(BaseHook):
# Override with the object that exposes the connect method
connector = None # type: Optional[ConnectorProtocol]

def __init__(self, *args, **kwargs):
def __init__(self, *args, schema: Optional[str] = None, **kwargs):
super().__init__()
if not self.conn_name_attr:
raise AirflowException("conn_name_attr is not defined")
Expand All @@ -72,7 +79,11 @@ def __init__(self, *args, **kwargs):
setattr(self, self.conn_name_attr, self.default_conn_name)
else:
setattr(self, self.conn_name_attr, kwargs[self.conn_name_attr])
self.schema: Optional[str] = kwargs.pop("schema", None)
# We should not make schema available in deriving hooks for backwards compatibility
# If a hook deriving from DBApiHook has a need to access schema, then it should retrieve it
# from kwargs and store it on its own. We do not run "pop" here as we want to give the
# Hook deriving from the DBApiHook to still have access to the field in it's constructor
self.__schema = schema

def get_conn(self):
"""Returns a connection object"""
Expand All @@ -92,7 +103,7 @@ def get_uri(self) -> str:
host = conn.host
if conn.port is not None:
host += f':{conn.port}'
schema = self.schema or conn.schema or ''
schema = self.__schema or conn.schema or ''
return urlunsplit((conn.conn_type, f'{login}{host}', schema, '', ''))

def get_sqlalchemy_engine(self, engine_kwargs=None):
Expand Down
1 change: 1 addition & 0 deletions airflow/providers/postgres/hooks/postgres.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.connection: Optional[Connection] = kwargs.pop("connection", None)
self.conn: connection = None
self.schema: Optional[str] = kwargs.pop("schema", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, maybe this:

Suggested change
self.schema: Optional[str] = kwargs.pop("schema", None)
if not hasattr(self, "schema"):
self.schema: Optional[str] = kwargs.pop("schema", None)

That should work for core both pre and post 2.2.0 and wouldn't break behavior of get_uri either like doing __schema does.

Copy link
Member Author

@potiuk potiuk Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that having schema as attribute in DBApiHook will encourage people to use it in their other Hooks deriving from DBApiHook. It will be rather difficult to make sure that everyone is using it in "hasattr" way. Even if we two remember it now, we will forget about it and other committers who are not involved will not even know about this. We could potentially automate a pre-commit check if the "DBApiHook's schema" is accessed with hasattr, but I think we will never be able to do it with 100% accuracy and I think it's simply not worth it for that single field, that can be simply replaced with single line for each operator that wants to use it (in init):

        self.schema: Optional[str] = kwargs.pop("schema", None)

I think we should think about DBApiHook as "Public" API of Airflow and any change to it should be very, very carefully considered.

Another option would be to add >=Airflow 2.2 limitation to Postgres operator (and any other operator that uses it), but again I think sacrificing backwards compatibility in this case is simply not worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah actually I see an error .in my solution.. I should not do "pop" in the args :)

Copy link
Member Author

@potiuk potiuk Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "pop" from DBApi to leave it for other hooks.

Actually I just realized the way it was implemented before - with pop() had undesired side effect for Hooks that already used kwargs.pop() .The side effect was that kwargs.pop() in DBApiHook would remove the schema from kwargs in derived classes and their self.schema = kwargs.pop("schema", None) would override the schema to None (!)

Example: mysql:

https://github.com/apache/airflow/blob/main/airflow/providers/mysql/hooks/mysql.py#L61

So in fact, the original change (not yet released luckily) in DBApiHook was even more disruptive than the failed PostgresHook. I am actually glad it was uncovered now, as it would be far more disruptive it was released in Airflow.

That's why we should be extremely careful with changing DBApiHook (and BaseHook and similar).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 🙀

Copy link
Member Author

@potiuk potiuk Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unlikely edge case (much less likely than use of schema when it will be defined in DBApiHook as public field). I hardly see the use for that. Why somone would like to define a DBApi -derived hook and pass a 'schema' parameter to it while also changing it in it's own __init__? Seems extremely unlikely, also It feels natural that in such case the change should be done before super.__init__() rather than after.

We can assume that "schema" is our convention for kwarg para that we should use for all DB Hooks. We can standardise it (it's already commonly used) and release in Airflow 3 DBApiHook I think.

For now we might want to add some more docs/comments explaining it and some Airflow 3 notice about it. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have included an example. This is what I meant:

h = SomeHook(schema="foo")
h.get_uri() # uses foo
h.schema = "bar"
h.get_uri() # still uses foo because of __schema

That's why I'm thinking letting both DbApiHook and any derived hooks both set schema might be the best of both worlds here? Then the derived hooks could stop setting schema in Airflow 3 (or the next time they get a min core version bump really).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I think i fiigured out better way of doing it. I've added the 'schema' argument explicitly as optional kwarg to DBApiHook and made a comment about the "change schema value" in the description of the parameter.

I think it's much better - it makes schema an explicit part of the DBHookAPI hook's API, it is fully backwards compatible and it makes it very easy for Airflow 3 change - we will simply make the self.schema public and convert all the operators that use it in the "old" way. See the latest fixup.

Copy link
Member Author

@potiuk potiuk Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I'm thinking letting both DbApiHook and any derived hooks both set schema might be the best of both worlds here? Then the derived hooks could stop setting schema in Airflow 3 (or the next time they get a min core version bump really).

The problem with this is - that it has already happened that we overlooked that the DBApiHook.schema object has been accessed directly by a provider, which made it backwards incompatible with Airflow 2.1. We do not have any mechanism to prevents this is in the future again, if we have it as a public field in DBApiHook. This is a bit of a problem that DBAPiHook is "public API" part and by making a public field in this class, we change the API.

The way I proposed - in the last fixup, schema becomes part of the DBApiHook 'initialization' API. If any other hook stores the schema field as self.schema - so be it, it is "its own responsibiliity" - we make it clear now in the constructor of the DBApiHook that passing "schema" as kwargs is THE way how to override the schema, and by not having a public field, we clearly say that the deriving hook cannot expect that it can change it and expect "DBApiHook" getUri() method will use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other comments? I think We will not release an out-of-band Postgres operator, so this is something we will have to solve mid-August, but would be good to get some opinions :)


def _get_cursor(self, raw_cursor: str) -> CursorType:
_cursor = raw_cursor.lower()
Expand Down
6 changes: 3 additions & 3 deletions tests/hooks/test_dbapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ def get_conn(self):
return conn

self.db_hook = UnitTestDbApiHook()
self.db_hook_schema_override = UnitTestDbApiHook(schema='schema-override')

def test_get_records(self):
statement = "SQL"
Expand Down Expand Up @@ -160,7 +161,7 @@ def test_get_uri_schema_not_none(self):
assert "conn_type://login:password@host:1/schema" == self.db_hook.get_uri()

def test_get_uri_schema_override(self):
self.db_hook.get_connection = mock.MagicMock(
self.db_hook_schema_override.get_connection = mock.MagicMock(
return_value=Connection(
conn_type="conn_type",
host="host",
Expand All @@ -170,8 +171,7 @@ def test_get_uri_schema_override(self):
port=1,
)
)
self.db_hook.schema = 'schema-override'
assert "conn_type://login:password@host:1/schema-override" == self.db_hook.get_uri()
assert "conn_type://login:password@host:1/schema-override" == self.db_hook_schema_override.get_uri()

def test_get_uri_schema_none(self):
self.db_hook.get_connection = mock.MagicMock(
Expand Down