-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement --export-slice-airflow
#320
Conversation
lineapy/instrumentation/tracer.py
Outdated
@@ -156,6 +161,30 @@ def sliced_func(self, slice_name: str, func_name: str) -> str: | |||
full_code = format_str(full_code, mode=black_mode) | |||
return full_code | |||
|
|||
def sliced_aiflow_dag(self, slice_name: str, func_name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving the airflow specific code to a different folder, like linea/airflow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created plugins/airflow.py
, wdyt?
@saulshanabrook @yifanwu This produces a valid DAG, but I still need to add tests, hence Draft. |
Nit: please add the example to README: Please also add instructions in the readme about how we can run the airflow jobs? |
Also for future reference, please have, in your PR description, a brief summary of your changes? |
…e blocking airflow work. will sync up with Saul to see if this is acceptable or if we have a better solution.
+ AIRFLOW_IMPORTS_TEMPLATE | ||
+ "\n\n" | ||
+ code_block | ||
+ f"\n\tprint({artifact_var})" # TODO What to do with artifact_var in a DAG? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to discuss and design this in more detail before we can commit this PR. @marov you are better equipped than us to help design what to do with this artifact in a deployed airflow process. Let's look at some reference cases together on our next meeting (Friday).
@@ -84,6 +84,19 @@ jupyter nbconvert --to notebook --execute tests/test_notebook.ipynb --inplace -- | |||
Or you can open it in a notebook UI (JupyterLab, JupyterNotebook, VS Code, etc.) | |||
and re-run it manually | |||
|
|||
### Airflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really be in the README.md, but since it's not ready for "release", maybe we can create another doc called EXPERIMENTAL.md
or something and move this there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under impression that CONTRIBUTING.md
serves this function.
``` | ||
lineapy tests/housing.py --slice "p value" --airflow sliced_housing_dag | ||
``` | ||
This creates a `sliced_housing_dag.py` file in the current dir. It can be executed with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this into a separate folder (when generating)? Cleaner organization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, how would you call it? output
or export
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or something like airflow_jobs
/airflow_dags
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this we are saving to the same dir where the original file is, which is tests
for the housing.py
. This seems to be OK for me.
I can change to any other place of course :)
|
||
dag = DAG( | ||
dag_id="DAG_NAME_dag", | ||
schedule_interval="*/15 * * * *", # Every 15 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to figure out how to parametrize this for API design. I'm not sure about the workflow where the user goes in and manually modifies things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we need to think about the entire UX. We can't just keep adding options to CLI, right?
I think a config filler a set of configs are in order. I'd vote for YAML, but open to anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's talk about it this afternoon.
default_dag_args = {"owner": "airflow", "retries": 2, "start_date": days_ago(1)} | ||
|
||
dag = DAG( | ||
dag_id="DAG_NAME_dag", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we have two artifacts? Will this logic break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we can only slice one artifact through --slice
, it produces code which we save as this DAG. If there are other artifacts in DB it won't be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm not following, do the dag_id
s not need to be unique?
I was trying to debug the mypy error. I re-run an earlier commit here on CI and found this error in install:
I wonder why flask is being pulled in? |
I think Airflow uses flask for it's UI, see here https://github.com/apache/airflow/blob/f5ad26dcdd7bcb724992528dce71056965b94d26/setup.py#L422 |
@marov thanks for checking that out. I am trying to debug the versioning problems with mypy. It looks like currently it's installing an older version of sqlalchemy, because of some transitive dependency of airflow, which doesn't include the mypy plugin, which is breaking the tests. I am looking into if there is a solution that can work for us, to keep a newer version of sqlalchemy. |
it seems the issue is that flask-appbuilder depends on an old version of sqlalchemy (dpgaspar/Flask-AppBuilder#1710), which is in turn waiting on this PR to be merged into flask-sqlalchemy (pallets-eco/flask-sqlalchemy#1001) |
@@ -20,7 +20,7 @@ relative_files = true | |||
|
|||
# https://docs.sqlalchemy.org/en/14/orm/extensions/mypy.html | |||
# https://pydantic-docs.helpmanual.io/mypy_plugin/#enabling-the-plugin | |||
plugins = ["sqlalchemy.ext.mypy.plugin", "pydantic.mypy"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we try just removing the sqlalchemy plugin but keeping the mypy one? And commenting that its disabled until we can add sqlalchemy >= 1.4, which adds mypy support, but is blocked on flask-appbuilder, which is a dep of airflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that. However, as I've mentioned to @yifanwu today, Airflow is but one of the data tools that linea can have "plugins" to.
New tools will come with new dependencies and more issues like this.
Hence I propose:
- Merge this PR (removing sqlalchemy plugin and leaving a TODO in comments)
- Next create in a different repo (or just a folder?) a place for external plugins with examples and tests
- Each of the "plugins" will have its own
setup.py
decoupling it from the rest - Is Jupyter also falls in this category? One can argue both ways - we can say that Jupyter support is the core feature of lineapy, but I'd argue that it lineapy can work without Jupyter. And Jupyter is not the only notebook server either (Zeppelin etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next create in a different repo (or just a folder?) a place for external plugins with examples and tests
I am +1 on monorepos when the code is tightly coupled!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, @yifanwu @saulshanabrook how shall we proceed?
We talked about this at standup today and sounds good to merge now. |
Closes #275
--export-slice-airflow
(or simply--airflow
sliced_housing_dag.py
p
fromlineapy tests/housing.py --slice "p value" --airflow sliced_housing_dag