Remove db call from `DatabricksHook.init()` #20180

josh-fell · 2021-12-09T20:27:53Z

This PR aligns both the original PR to move the db call to the metadata database out of the DatabricksHook.__init__() method (#18339) and a refactoring PR for the same hook (#19835).

FYI @potiuk I did not add Mypy fixes to this file. The number of changes to handle the Mypy errors out-numbered the changes in scope here. I'll have those Mypy updates in a separate PR.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

josh-fell · 2021-12-09T20:28:09Z

CC @kaxil

potiuk

Nice. I wonder how many of those we will find in other operators..

BTW. for the multi-tenancy work we are planning to add testing the DB method calls and we are planning to run all the tests we have for all the hooks and operators through a test harness tthat will be catching basically any DB operations and flagging those that are "not expected".

I think that might be great opportunity to catch and fix all similar cases too. Would be as easy as to check if there is a db call in any of the initts of any of the operators instantiated during tests. Sounds pretty doable.

github-actions · 2021-12-09T20:59:43Z

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

airflow/providers/databricks/hooks/databricks.py

josh-fell · 2021-12-09T21:16:35Z

Nice. I wonder how many of those we will find in other operators..

BTW. for the multi-tenancy work we are planning to add testing the DB method calls and we are planning to run all the tests we have for all the hooks and operators through a test harness tthat will be catching basically any DB operations and flagging those that are "not expected".

I think that might be great opportunity to catch and fix all similar cases too. Would be as easy as to check if there is a db call in any of the initts of any of the operators instantiated during tests. Sounds pretty doable.

After seeing this I poked around. There are a good number of instances where, at the very least, get_connection() is called in an operator's constructor as the "db call". Not sure if there are other types of db calls though. I can compile a list and create an issue for folks to tackle. WDYT?

Co-authored-by: Ash Berlin-Taylor <[email protected]>

potiuk · 2021-12-09T21:55:12Z

After seeing this I poked around. There are a good number of instances where, at the very least, get_connection() is called in an operator's constructor as the "db call". Not sure if there are other types of db calls though. I can compile a list and create an issue for folks to tackle. WDYT?

Yep. I suspected just that. Until we have an auto-detection of such cases this is a bit of an "uphill battle" (if it slipped through before, it will slip through in the future as well), but yeah i think we could do a one time effort to cleanup it long before we automate it.

Or maybe you could automate that not by running but by what you've done via - essentially - static analysis (either our own or plugin to one of the existing checkers). That would be even better.

We could have done that with pylint plugin but since we all (including myself) hate it, we should figure another way.

kaxil · 2021-12-09T22:33:20Z

After seeing this I poked around. There are a good number of instances where, at the very least, get_connection() is called in an operator's constructor as the "db call". Not sure if there are other types of db calls though. I can compile a list and create an issue for folks to tackle. WDYT?

Yep. I suspected just that. Until we have an auto-detection of such cases this is a bit of an "uphill battle" (if it slipped through before, it will slip through in the future as well), but yeah i think we could do a one time effort to cleanup it long before we automate it.

Or maybe you could automate that not by running but by what you've done via - essentially - static analysis (either our own or plugin to one of the existing checkers). That would be even better.

We could have done that with pylint plugin but since we all (including myself) hate it, we should figure another way.

Yeah agree, we do instantiate all provider classes somewhere in our test no?? I can’t remember where but that might be a good place to check this

potiuk · 2021-12-10T03:15:31Z

Yeah agree, we do instantiate all provider classes somewhere in our test no?? I can’t remember where but that might be a good place to check this

Unfortunately, we only import all.the classes :(. Not really instantiate them (that would be impossible as we do not know which parameters to pass to constructors).

We (hopefully) instantiate all operators in the unit tests - so there we could (and we plan to for multi-tenancy/db isolation) tap into the db session created and warn if an operator performs an unexpected DB operation.

But that is a bit involved because we have to pass through all the 'expected' calls as well to the real db because otherwise many tests will fail. For example if a test performs get_connection() in the execute() - it should be passed through during unit tests.

We have the description coming in AIP-44 describing the test harness and what it is going to give us as this is a part of the solution. We will be be able to verify that community managed operators are all compliant with the DB isolation mode (and fix those that aren't). And we can also (i believe) include that check for db operations in init() for the operators as part of the test harness. We can basically check the stack trace and see if any BaseOperator's derived class is calling any db operation in their init.

Static checks might also be helpful but they are limited - they can only (reliable) go as far as to check if certain methods (like get_connectiion) are called directly in the _init but any transitive calls will not be possible to track reliably. Only actually instantiating the operators (which can only be done in unit tests IMHO) can give us complete answer.

It's also not 100% complete with tests as there might be some combination of parameters in constructors that will choose different,.untested paths, but hopefully unit tests are comprehensive enough to cover those in most cases. Also when we have the harness in place, it will be easy to reproduce and fix any problems with db isolation by adding more unit tests with those paths covered.

I am super excited for this part especially - as i wanted to do that check on init'() for quite a while and with multi-tenancy we have finally a good reason to do it. Before that was not justified enough to extend our test harness.

Remove db call from DatabricksHook.__init__

34beb4a

boring-cyborg bot added the area:providers label Dec 9, 2021

potiuk approved these changes Dec 9, 2021

View reviewed changes

github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Dec 9, 2021

ashb reviewed Dec 9, 2021

View reviewed changes

airflow/providers/databricks/hooks/databricks.py Outdated Show resolved Hide resolved

Add check before calling get_connection()

cdb54db

Co-authored-by: Ash Berlin-Taylor <[email protected]>

potiuk approved these changes Dec 9, 2021

View reviewed changes

kaxil approved these changes Dec 9, 2021

View reviewed changes

ashb approved these changes Dec 10, 2021

View reviewed changes

ashb merged commit 66f94f9 into apache:main Dec 10, 2021

josh-fell deleted the databrickshook-remove-db-call-from-init branch December 10, 2021 16:11

josh-fell mentioned this pull request Dec 10, 2021

Status of testing Providers that were prepared on December 07, 2021 #20097

Closed

9 tasks

potiuk mentioned this pull request Dec 11, 2021

Status of testing Providers that were prepared on December 11, 2021 #20220

Closed

11 tasks

potiuk added the mypy Fixing MyPy problems after bumpin MyPy to 0.990 label Dec 13, 2021

josh-fell mentioned this pull request Dec 28, 2021

Remove host as an instance attr in DatabricksHook #20540

Merged

kaxil added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove db call from `DatabricksHook.init()` #20180

Remove db call from `DatabricksHook.init()` #20180

josh-fell commented Dec 9, 2021

josh-fell commented Dec 9, 2021

potiuk left a comment

github-actions bot commented Dec 9, 2021

josh-fell commented Dec 9, 2021 •

edited

Loading

potiuk commented Dec 9, 2021

kaxil commented Dec 9, 2021

potiuk commented Dec 10, 2021 •

edited

Loading

Remove db call from DatabricksHook.__init__() #20180

Remove db call from DatabricksHook.__init__() #20180

Conversation

josh-fell commented Dec 9, 2021

josh-fell commented Dec 9, 2021

potiuk left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 9, 2021

josh-fell commented Dec 9, 2021 • edited Loading

potiuk commented Dec 9, 2021

kaxil commented Dec 9, 2021

potiuk commented Dec 10, 2021 • edited Loading

Remove db call from `DatabricksHook.init()` #20180

Remove db call from `DatabricksHook.init()` #20180

josh-fell commented Dec 9, 2021 •

edited

Loading

potiuk commented Dec 10, 2021 •

edited

Loading