Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sqlalchemy Task #445

Merged
merged 27 commits into from
Apr 22, 2021
Merged

Sqlalchemy Task #445

merged 27 commits into from
Apr 22, 2021

Conversation

max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Apr 8, 2021

TL;DR

This PR adds an SQLAlchemyTask modeled after SQLite3Task. This task loads a DataFrame from a database using an connection dialect URL.

Note:

  • This is not Dolt (or any other database) specific, I tested with Dolt because that was easiest for me. This should be tested with MySQL, Postgres and SQLite at some point.

Optional:

  • SQLAlchemy connection arguments can differ between dialects. A generic connect_args is provided to override the URL-specified values.
  • I've added a naive secret loading option -- a dictionary of key->(name, group) pairs that will be merged with the connection args. Ex: {"password": {"group": SECRET_GROUP, "name": SECRET_NAME}} will override the password value with the result fetched from the Security Context.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

This is the entirety of the connection point. SQLAlchemy has lots of gotchas for different dialects, but those have to be shaken out in testing:

def execute(self, **kwargs) -> typing.Any:
    engine = create_engine(self._uri, connect_args=self._connect_args, echo=False)
    with engine.begin() as connection:
    df = pd.read_sql_query(self.get_query(**kwargs), connection)

I've tested this with MySQL. Choosing a multi-database testing setup is messy and different between librariese, and I have not added this to the PR yet. I usually see libraries matrix testing code like this, for example with docker compose setups that swap image based on database name, or using SQLAlchemy's ORM model to insert test data arbitrary to the database endpoint.

If I was going to add these, I would try to use pytest database fixtures to get as close to unit-tests, before adding smoke tests for Task-runs in the orchestrator.

This is the only other non-boilerplate:

        for key, secret in task_config.secret_connect_args.items():
            if "name" in secret and "group" in secret:
                value = flytekit.current_context().secrets.get(secret["group"], secret["name"])
                self._connect_args[key] = value

Tracking Issue

https://github.com/lyft/flyte/issues/

Follow-up issue

NA
OR
https://github.com/lyft/flyte/issues/

if task_config.secret_connect_args is not None:
for key, secret in task_config.secret_connect_args.items():
if "name" in secret and "group" in secret:
value = current_context().secrets.get(secret["group"], secret["name"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can only access the secrets in the execute command or logically. They will be available here, but you should be using them only in the execute. Reason is init will be invoked at convert / compile time, but execute only at runtime.

Also you will need to pass the secret requests to thparent
https://github.com/flyteorg/flytekit/blob/master/flytekit/core/python_auto_container.py#L47

super().init(..., secret_requests=[...])

and access them in execute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. the secret_requests should pass through in kwargs

from flytekit import current_context, kwtypes, Secret, task, workflow
from flytekit.extras.sqlalchemy.task import SQLAlchemyConfig, SQLAlchemyTask

# https://www.sqlitetutorial.net/sqlite-sample-database/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

try:
d = tempfile.TemporaryDirectory()
db_path = os.path.join(d.name, "tracks")
db = dolt.Dolt.init(db_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting that dolt is only a test dependencyy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this integration will work with any sqlalchemy dialect. Dolt is probably not the best way to test this, honestly. Setting up docker-compose resources for mysql, postgres, and a few others to test the integration is a lot of code, though https://haseebmajid.dev/blog/gitlab-ci-pytest-and-docker-compose. Other alternatives work, that's how i'd do it though.

@codecov
Copy link

codecov bot commented Apr 15, 2021

Codecov Report

Merging #445 (fe59fdc) into master (8528268) will increase coverage by 0.03%.
The diff coverage is 94.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #445      +/-   ##
==========================================
+ Coverage   84.36%   84.39%   +0.03%     
==========================================
  Files         340      344       +4     
  Lines       25743    25835      +92     
  Branches     2110     2114       +4     
==========================================
+ Hits        21717    21804      +87     
- Misses       3455     3458       +3     
- Partials      571      573       +2     
Impacted Files Coverage Δ
...gins/sqlalchemy/flytekitplugins/sqlalchemy/task.py 89.18% <89.18%> (ø)
plugins/tests/sqlalchemy/test_task.py 97.61% <97.61%> (ø)
.../sqlalchemy/flytekitplugins/sqlalchemy/__init__.py 100.00% <100.00%> (ø)
plugins/tests/sqlalchemy/test_sql_tracker.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8528268...fe59fdc. Read the comment docs.

@max-hoffman max-hoffman marked this pull request as ready for review April 15, 2021 05:46
@max-hoffman max-hoffman changed the title Add pseudo-code for sqlalchemy Sqlalchemy Task Apr 15, 2021
@kumare3
Copy link
Contributor

kumare3 commented Apr 15, 2021

@max-hoffman so this is a Linux Foundation guideline, you have to signoff every commit you have using commit -s. One option could be to delete this branch and recreate a new branch that is a squash of all the commit with the signoff or use force signing every commit as explained in the DCO tab

@max-hoffman
Copy link
Contributor Author

@kumare3 I rebased with a signoff according to the DCO guidelines last night, is there additional steps?


uri: str
connect_args: typing.Optional[typing.Dict[str, typing.Any]] = None
secret_connect_args: typing.Optional[typing.Dict[str, typing.Dict[str, typing.Any]]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not another dataclass? Its always 2 values at max right? Actual you can just use the secrets class here -

class Secret(_common.FlyteIdlEntity):
right?

outputs = kwtypes(results=output_schema_type if output_schema_type else FlyteSchema)
self._uri = task_config.uri
self._connect_args = task_config.connect_args or {}
self._secret_connect_args = task_config.secret_connect_args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should patch the secrets from here into the parent constructor and then we can easily drop **kwargs.
Let me know if you want to chat about this, and maybe the model we have needs a little bit of an update to clean it up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're using secret_requests for config overrides we'll still need the key mapping somehow, unless we assume one of the Secret fields is the key

@kumare3
Copy link
Contributor

kumare3 commented Apr 16, 2021

@max-hoffman Sorry I did not realize this but for all new plugins we actually use the Microlib model. Can you please move the SQLAlchemy plugin to - https://github.com/flyteorg/flytekit/tree/master/plugins
This way all the dependencies and requirements are separately maintained. Also flytekitplugins-sql will be a separate package in pypi

@max-hoffman
Copy link
Contributor Author

@kumare3 Yeah no worries, I was wondering why the SQLite3 task was separated from other plugins.

@kumare3
Copy link
Contributor

kumare3 commented Apr 16, 2021

@kumare3 Yeah no worries, I was wondering why the SQLite3 task was separated from other plugins.

ya just because it does not need extra libraries. We wanted a canonical SQL example to ship with Flytekit

pip freeze
- name: Install Dolt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this? Can we make this part of the plugin test?

Copy link
Contributor Author

@max-hoffman max-hoffman Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "which database to test with" trade-off. Testing with Dolt was easy but I need this executable. I think this should be replaced by more comprehensive tests with Postgres and Mysql, but those would have a bigger testing footprint and a longer dev-timeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the plugin could be tested with SQLite, which wouldn't have dependency issues, but that might take me a bit to figure out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that we should run plugin unit tests as part of CI but that gets kicked off from the Makefile which depends on this setup line which then runs this pip install so the sqlalchemy setup.py is still run. Any setup steps you put there will be executed. It sounds like the implication is that pip installing this plugin isn't quite enough to get started, there's a few more commands to run after the pip install step before the user can start using it? The constructs around post-install hooks seem scarce though. Can we replace with default behavior by any chance?

I don't know much about sqlalchemy... is this plugin for all sqlalchemy? Or is it only for dolt? If there are hard dependencies on dolt for this plugin, should we rename to DoltSqlAlchemy?

Sorry for all the back and forth! I'm out most of tomorrow (tuesday) but can do a google meet in the morning or the day after if maybe that's faster? And thank you for doing this!

Copy link
Contributor Author

@max-hoffman max-hoffman Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries! this doc is my summary for what sqlalchemy does in the context of flyte, as a short summary. feel free to ping me in slack or discord to meet whenever. will follow up with specific QA below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlalchemy is generic, not just for dolt. you put any SQL server string in there, specify the dialect (mysql, postgres, etc) and it will issue the proper select syntax. if you look at the code in PR summary above, it's quite concise -- a connect string and a pandas read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting up test database servers for every example data source is much more difficult. Each database server requires a different package except sqlite, maybe. That's why I added dolt as a test dependency -- to make a database server test fixture. I could have used postgres, mysql, oracle or sqlite instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quickest with dolt -- in retrospect sqlite might have been a better choice. Ketan said you guys would have a way to inject server dependencies soon, so I assumed my tests would be phased out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-install hooks would include the test dependencies in the package, which I think you want to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking so much time looking at this! I got the initial version working in 20 minutes but I really should have used sqlite -- saved us both from the back and forth

@@ -2,6 +2,7 @@

black
coverage[toml]
doltcli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a test-dependency in the plugin pull it in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm a little confused... you're saying that users of the sqlalchemy plugin will need to install dolt dependencies as well to use it? If that's the case, can we just move this line into the sqlalchemy setup.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlalchemy is database-independent, but needs at least one for testing. Moving to the sqlalchemy setup is OK but it's a test dependency -- if it's in the sqlalchemy setup.py test deps will it actually be installed for tests? (I think no? but I don't completely understand the micro-package organization.) I probably should have done this all with sqlite, it will take me awhile to rewrite the tests though


microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

plugin_requires = ["flytekit>=0.16.0b0,<1.0.0", "sqlalchemy>=1.4.7", "pymysql>=1.0.2"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
plugin_requires = ["flytekit>=0.16.0b0,<1.0.0", "sqlalchemy>=1.4.7", "pymysql>=1.0.2"]
plugin_requires = ["flytekit>=0.17.0,<1.0.0", "sqlalchemy>=1.4.7", "pymysql>=1.0.2"]

setup.py Outdated
@@ -19,8 +19,9 @@
hive_sensor = ["hmsclient>=0.0.1,<1.0.0"]
notebook = ["papermill>=1.2.0", "nbconvert>=6.0.7", "ipykernel>=5.0.0"]
sagemaker = ["sagemaker-training>=3.6.2,<4.0.0"]
mysql = ["sqlalchemy>=1.4.7", "pymysql>=1.0.2"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's nix this line? These are all for the old style of plugins (the ones that live in flytekit/plugins rather than the new top level plugins folder). We're trying to deprecate all the old-api related things.

setup.py Outdated

all_but_spark = sidecar + schema + hive_sensor + notebook + sagemaker
all_but_spark = sidecar + schema + hive_sensor + notebook + sagemaker + mysql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert here as well.

try:
d = tempfile.TemporaryDirectory()
db_path = os.path.join(d.name, "tracks")
db = dolt.Dolt.init(db_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just remove this test, and then remove all dolt-specific dependencies? Then we should merge. I'm totally happy having a dolt-specific example in the flytesnacks repository under cookbook/plugins that uses this task. I think that would be cleanest.

@max-hoffman
Copy link
Contributor Author

@wild-endeavor I removed dolt tests, added an sqlite test because that doesn't require dependencies, but coverage is going to be missing lines because the secret-parsing paths aren't checked. I can add code that hits the secret lines with sqlite, but without a way to check whether they were set correctly.

"Topic :: Software Development :: Libraries",
"Topic :: Software Development :: Libraries :: Python Modules",
],
scripts=["scripts/flytekit_install_sqlalchemy.sh"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay one last comment sorry. this script is missing. do we need this? Can we move the dolt install script to the example in the future as well and just remove it, or is there a new script that's needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no script needed, good catch i'll fix in a couple

Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
Signed-off-by: Max Hoffman <[email protected]>
@wild-endeavor wild-endeavor merged commit d77a773 into flyteorg:master Apr 22, 2021
max-hoffman added a commit to dolthub/flytekit that referenced this pull request Apr 29, 2021
max-hoffman added a commit to dolthub/flytekit that referenced this pull request Apr 29, 2021
max-hoffman added a commit to dolthub/flytekit that referenced this pull request Apr 29, 2021
max-hoffman added a commit to dolthub/flytekit that referenced this pull request Apr 29, 2021
max-hoffman added a commit to dolthub/flytekit that referenced this pull request May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants