Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove db call from DatabricksHook.__init__() #20180

Merged
merged 2 commits into from
Dec 10, 2021

Conversation

josh-fell
Copy link
Contributor

This PR aligns both the original PR to move the db call to the metadata database out of the DatabricksHook.__init__() method (#18339) and a refactoring PR for the same hook (#19835).

FYI @potiuk I did not add Mypy fixes to this file. The number of changes to handle the Mypy errors out-numbered the changes in scope here. I'll have those Mypy updates in a separate PR.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@josh-fell
Copy link
Contributor Author

CC @kaxil

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I wonder how many of those we will find in other operators..

BTW. for the multi-tenancy work we are planning to add testing the DB method calls and we are planning to run all the tests we have for all the hooks and operators through a test harness tthat will be catching basically any DB operations and flagging those that are "not expected".

I think that might be great opportunity to catch and fix all similar cases too. Would be as easy as to check if there is a db call in any of the initts of any of the operators instantiated during tests. Sounds pretty doable.

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Dec 9, 2021
@github-actions
Copy link

github-actions bot commented Dec 9, 2021

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@josh-fell
Copy link
Contributor Author

josh-fell commented Dec 9, 2021

Nice. I wonder how many of those we will find in other operators..

BTW. for the multi-tenancy work we are planning to add testing the DB method calls and we are planning to run all the tests we have for all the hooks and operators through a test harness tthat will be catching basically any DB operations and flagging those that are "not expected".

I think that might be great opportunity to catch and fix all similar cases too. Would be as easy as to check if there is a db call in any of the initts of any of the operators instantiated during tests. Sounds pretty doable.

After seeing this I poked around. There are a good number of instances where, at the very least, get_connection() is called in an operator's constructor as the "db call". Not sure if there are other types of db calls though. I can compile a list and create an issue for folks to tackle. WDYT?

@potiuk
Copy link
Member

potiuk commented Dec 9, 2021

After seeing this I poked around. There are a good number of instances where, at the very least, get_connection() is called in an operator's constructor as the "db call". Not sure if there are other types of db calls though. I can compile a list and create an issue for folks to tackle. WDYT?

Yep. I suspected just that. Until we have an auto-detection of such cases this is a bit of an "uphill battle" (if it slipped through before, it will slip through in the future as well), but yeah i think we could do a one time effort to cleanup it long before we automate it.

Or maybe you could automate that not by running but by what you've done via - essentially - static analysis (either our own or plugin to one of the existing checkers). That would be even better.

We could have done that with pylint plugin but since we all (including myself) hate it, we should figure another way.

@kaxil
Copy link
Member

kaxil commented Dec 9, 2021

After seeing this I poked around. There are a good number of instances where, at the very least, get_connection() is called in an operator's constructor as the "db call". Not sure if there are other types of db calls though. I can compile a list and create an issue for folks to tackle. WDYT?

Yep. I suspected just that. Until we have an auto-detection of such cases this is a bit of an "uphill battle" (if it slipped through before, it will slip through in the future as well), but yeah i think we could do a one time effort to cleanup it long before we automate it.

Or maybe you could automate that not by running but by what you've done via - essentially - static analysis (either our own or plugin to one of the existing checkers). That would be even better.

We could have done that with pylint plugin but since we all (including myself) hate it, we should figure another way.

Yeah agree, we do instantiate all provider classes somewhere in our test no?? I can’t remember where but that might be a good place to check this

@potiuk
Copy link
Member

potiuk commented Dec 10, 2021

Yeah agree, we do instantiate all provider classes somewhere in our test no?? I can’t remember where but that might be a good place to check this

Unfortunately, we only import all.the classes :(. Not really instantiate them (that would be impossible as we do not know which parameters to pass to constructors).

We (hopefully) instantiate all operators in the unit tests - so there we could (and we plan to for multi-tenancy/db isolation) tap into the db session created and warn if an operator performs an unexpected DB operation.

But that is a bit involved because we have to pass through all the 'expected' calls as well to the real db because otherwise many tests will fail. For example if a test performs get_connection() in the execute() - it should be passed through during unit tests.

We have the description coming in AIP-44 describing the test harness and what it is going to give us as this is a part of the solution. We will be be able to verify that community managed operators are all compliant with the DB isolation mode (and fix those that aren't). And we can also (i believe) include that check for db operations in init() for the operators as part of the test harness. We can basically check the stack trace and see if any BaseOperator's derived class is calling any db operation in their init.

Static checks might also be helpful but they are limited - they can only (reliable) go as far as to check if certain methods (like get_connectiion) are called directly in the _init but any transitive calls will not be possible to track reliably. Only actually instantiating the operators (which can only be done in unit tests IMHO) can give us complete answer.

It's also not 100% complete with tests as there might be some combination of parameters in constructors that will choose different,.untested paths, but hopefully unit tests are comprehensive enough to cover those in most cases. Also when we have the harness in place, it will be easy to reproduce and fix any problems with db isolation by adding more unit tests with those paths covered.

I am super excited for this part especially - as i wanted to do that check on init'() for quite a while and with multi-tenancy we have finally a good reason to do it. Before that was not justified enough to extend our test harness.

@ashb ashb merged commit 66f94f9 into apache:main Dec 10, 2021
@josh-fell josh-fell deleted the databrickshook-remove-db-call-from-init branch December 10, 2021 16:11
@potiuk potiuk added the mypy Fixing MyPy problems after bumpin MyPy to 0.990 label Dec 13, 2021
@kaxil kaxil added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) mypy Fixing MyPy problems after bumpin MyPy to 0.990 okay to merge It's ok to merge this PR as it does not require more tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants