-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sqlalchemy Task #445
Sqlalchemy Task #445
Conversation
flytekit/extras/sqlalchemy/task.py
Outdated
if task_config.secret_connect_args is not None: | ||
for key, secret in task_config.secret_connect_args.items(): | ||
if "name" in secret and "group" in secret: | ||
value = current_context().secrets.get(secret["group"], secret["name"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can only access the secrets in the execute
command or logically. They will be available here, but you should be using them only in the execute. Reason is init will be invoked at convert / compile time, but execute only at runtime.
Also you will need to pass the secret requests to thparent
https://github.com/flyteorg/flytekit/blob/master/flytekit/core/python_auto_container.py#L47
super().init(..., secret_requests=[...])
and access them in execute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. the secret_requests should pass through in kwargs
from flytekit import current_context, kwtypes, Secret, task, workflow | ||
from flytekit.extras.sqlalchemy.task import SQLAlchemyConfig, SQLAlchemyTask | ||
|
||
# https://www.sqlitetutorial.net/sqlite-sample-database/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
try: | ||
d = tempfile.TemporaryDirectory() | ||
db_path = os.path.join(d.name, "tracks") | ||
db = dolt.Dolt.init(db_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting that dolt is only a test dependencyy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this integration will work with any sqlalchemy dialect. Dolt is probably not the best way to test this, honestly. Setting up docker-compose resources for mysql, postgres, and a few others to test the integration is a lot of code, though https://haseebmajid.dev/blog/gitlab-ci-pytest-and-docker-compose. Other alternatives work, that's how i'd do it though.
Codecov Report
@@ Coverage Diff @@
## master #445 +/- ##
==========================================
+ Coverage 84.36% 84.39% +0.03%
==========================================
Files 340 344 +4
Lines 25743 25835 +92
Branches 2110 2114 +4
==========================================
+ Hits 21717 21804 +87
- Misses 3455 3458 +3
- Partials 571 573 +2
Continue to review full report at Codecov.
|
ec182a1
to
79c6e83
Compare
79c6e83
to
18a3c2d
Compare
@max-hoffman so this is a Linux Foundation guideline, you have to signoff every commit you have using commit |
@kumare3 I rebased with a signoff according to the DCO guidelines last night, is there additional steps? |
flytekit/extras/sqlalchemy/task.py
Outdated
|
||
uri: str | ||
connect_args: typing.Optional[typing.Dict[str, typing.Any]] = None | ||
secret_connect_args: typing.Optional[typing.Dict[str, typing.Dict[str, typing.Any]]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not another dataclass? Its always 2 values at max right? Actual you can just use the secrets class here -
flytekit/flytekit/models/security.py
Line 11 in 8528268
class Secret(_common.FlyteIdlEntity): |
flytekit/extras/sqlalchemy/task.py
Outdated
outputs = kwtypes(results=output_schema_type if output_schema_type else FlyteSchema) | ||
self._uri = task_config.uri | ||
self._connect_args = task_config.connect_args or {} | ||
self._secret_connect_args = task_config.secret_connect_args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should patch the secrets from here into the parent constructor and then we can easily drop **kwargs.
Let me know if you want to chat about this, and maybe the model we have needs a little bit of an update to clean it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're using secret_requests
for config overrides we'll still need the key mapping somehow, unless we assume one of the Secret fields is the key
7005af8
to
3ed5b09
Compare
@max-hoffman Sorry I did not realize this but for all new plugins we actually use the Microlib model. Can you please move the SQLAlchemy plugin to - https://github.com/flyteorg/flytekit/tree/master/plugins |
@kumare3 Yeah no worries, I was wondering why the SQLite3 task was separated from other plugins. |
ya just because it does not need extra libraries. We wanted a canonical SQL example to ship with Flytekit |
fdb975c
to
5b1b48b
Compare
.github/workflows/pythonbuild.yml
Outdated
pip freeze | ||
- name: Install Dolt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this? Can we make this part of the plugin test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "which database to test with" trade-off. Testing with Dolt was easy but I need this executable. I think this should be replaced by more comprehensive tests with Postgres and Mysql, but those would have a bigger testing footprint and a longer dev-timeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the plugin could be tested with SQLite, which wouldn't have dependency issues, but that might take me a bit to figure out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that we should run plugin unit tests as part of CI but that gets kicked off from the Makefile which depends on this setup line which then runs this pip install so the sqlalchemy setup.py is still run. Any setup steps you put there will be executed. It sounds like the implication is that pip installing this plugin isn't quite enough to get started, there's a few more commands to run after the pip install step before the user can start using it? The constructs around post-install hooks seem scarce though. Can we replace with default behavior by any chance?
I don't know much about sqlalchemy... is this plugin for all sqlalchemy? Or is it only for dolt? If there are hard dependencies on dolt for this plugin, should we rename to DoltSqlAlchemy?
Sorry for all the back and forth! I'm out most of tomorrow (tuesday) but can do a google meet in the morning or the day after if maybe that's faster? And thank you for doing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries! this doc is my summary for what sqlalchemy does in the context of flyte, as a short summary. feel free to ping me in slack or discord to meet whenever. will follow up with specific QA below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlalchemy is generic, not just for dolt. you put any SQL server string in there, specify the dialect (mysql, postgres, etc) and it will issue the proper select syntax. if you look at the code in PR summary above, it's quite concise -- a connect string and a pandas read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting up test database servers for every example data source is much more difficult. Each database server requires a different package except sqlite, maybe. That's why I added dolt as a test dependency -- to make a database server test fixture. I could have used postgres, mysql, oracle or sqlite instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quickest with dolt -- in retrospect sqlite might have been a better choice. Ketan said you guys would have a way to inject server dependencies soon, so I assumed my tests would be phased out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post-install hooks would include the test dependencies in the package, which I think you want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking so much time looking at this! I got the initial version working in 20 minutes but I really should have used sqlite -- saved us both from the back and forth
dev-requirements.in
Outdated
@@ -2,6 +2,7 @@ | |||
|
|||
black | |||
coverage[toml] | |||
doltcli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a test-dependency in the plugin pull it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm a little confused... you're saying that users of the sqlalchemy plugin will need to install dolt dependencies as well to use it? If that's the case, can we just move this line into the sqlalchemy setup.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlalchemy is database-independent, but needs at least one for testing. Moving to the sqlalchemy setup is OK but it's a test dependency -- if it's in the sqlalchemy setup.py test deps will it actually be installed for tests? (I think no? but I don't completely understand the micro-package organization.) I probably should have done this all with sqlite, it will take me awhile to rewrite the tests though
plugins/sqlalchemy/setup.py
Outdated
|
||
microlib_name = f"flytekitplugins-{PLUGIN_NAME}" | ||
|
||
plugin_requires = ["flytekit>=0.16.0b0,<1.0.0", "sqlalchemy>=1.4.7", "pymysql>=1.0.2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plugin_requires = ["flytekit>=0.16.0b0,<1.0.0", "sqlalchemy>=1.4.7", "pymysql>=1.0.2"] | |
plugin_requires = ["flytekit>=0.17.0,<1.0.0", "sqlalchemy>=1.4.7", "pymysql>=1.0.2"] |
setup.py
Outdated
@@ -19,8 +19,9 @@ | |||
hive_sensor = ["hmsclient>=0.0.1,<1.0.0"] | |||
notebook = ["papermill>=1.2.0", "nbconvert>=6.0.7", "ipykernel>=5.0.0"] | |||
sagemaker = ["sagemaker-training>=3.6.2,<4.0.0"] | |||
mysql = ["sqlalchemy>=1.4.7", "pymysql>=1.0.2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's nix this line? These are all for the old style of plugins (the ones that live in flytekit/plugins rather than the new top level plugins folder). We're trying to deprecate all the old-api related things.
setup.py
Outdated
|
||
all_but_spark = sidecar + schema + hive_sensor + notebook + sagemaker | ||
all_but_spark = sidecar + schema + hive_sensor + notebook + sagemaker + mysql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert here as well.
try: | ||
d = tempfile.TemporaryDirectory() | ||
db_path = os.path.join(d.name, "tracks") | ||
db = dolt.Dolt.init(db_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just remove this test, and then remove all dolt-specific dependencies? Then we should merge. I'm totally happy having a dolt-specific example in the flytesnacks repository under cookbook/plugins that uses this task. I think that would be cleanest.
Signed-off-by: Max Hoffman <[email protected]>
@wild-endeavor I removed dolt tests, added an sqlite test because that doesn't require dependencies, but coverage is going to be missing lines because the secret-parsing paths aren't checked. I can add code that hits the secret lines with sqlite, but without a way to check whether they were set correctly. |
04f869f
to
da8ed8b
Compare
plugins/sqlalchemy/setup.py
Outdated
"Topic :: Software Development :: Libraries", | ||
"Topic :: Software Development :: Libraries :: Python Modules", | ||
], | ||
scripts=["scripts/flytekit_install_sqlalchemy.sh"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay one last comment sorry. this script is missing. do we need this? Can we move the dolt install script to the example in the future as well and just remove it, or is there a new script that's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no script needed, good catch i'll fix in a couple
Signed-off-by: Max Hoffman <[email protected]>
da8ed8b
to
15b64b8
Compare
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
15b64b8
to
fe59fdc
Compare
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
TL;DR
This PR adds an
SQLAlchemyTask
modeled afterSQLite3Task
. This task loads a DataFrame from a database using an connection dialect URL.Note:
Optional:
connect_args
is provided to override the URL-specified values.{"password": {"group": SECRET_GROUP, "name": SECRET_NAME}}
will override the password value with the result fetched from the Security Context.Type
Are all requirements met?
Complete description
This is the entirety of the connection point. SQLAlchemy has lots of gotchas for different dialects, but those have to be shaken out in testing:
I've tested this with MySQL. Choosing a multi-database testing setup is messy and different between librariese, and I have not added this to the PR yet. I usually see libraries matrix testing code like this, for example with docker compose setups that swap image based on database name, or using SQLAlchemy's ORM model to insert test data arbitrary to the database endpoint.
If I was going to add these, I would try to use pytest database fixtures to get as close to unit-tests, before adding smoke tests for Task-runs in the orchestrator.
This is the only other non-boilerplate:
Tracking Issue
https://github.com/lyft/flyte/issues/
Follow-up issue
NA
OR
https://github.com/lyft/flyte/issues/