-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Snowpark Operator that allows execution of Snowflake Snowpark code #24456
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Since you are from Snowflake @sfc-gh-madkins - woudl you like to contribute it ? We are generally moving into direction where stakeholders will get more responsibility for adding and maintaining their providers (when there are strong stakeholders behind an operator) , so there will be more and more an expectation that stakeholders in a provider with submit and maintain more changes to their providers https://lists.apache.org/thread/6ngq79df7op541gfwntspdtsvzlv1cr6 (anyhow - otherwise it will have to wait until someone will pick it up - if someone from Snowflake won't). |
I would like to work on this. Disclaimer: I work for Snowflake (@sfc-gh-kbregula), but I contribute to Airflow during non-business hours. |
@turbaszek @mik-laj This vision I had for this was to smash together the python operator with our snowflake hook -- so in theory ppl get the same experience of authenticating as they do with the snowflake operator and the same experience with the python operator by providing a "Snowpark" callable -- that passes in the authenticates session variable |
I thought to just add Snowpark session creation support to the @task()
def get_count():
session = SnowflakeHook(
snowflake_conn_id="conn-id"
).get_snowpark_session()
return session.sql("SELECT count(*) FROM sample_product_data").collect() |
what if they want to run a series of snowpark pcode?
…On Tue, Jun 21, 2022 at 4:59 PM Kamil Breguła ***@***.***> wrote:
I thought to just add Snowpark session creation support to the
SnowflakeHook class. If the user uses the new Task Flow API
<https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html>,
we will have a very simple and readable code.
@task()
def get_count():
session = SnowflakeHook(
snowflake_conn_id="conn-id"
).get_snowpark_session()
return session.sql("SELECT count(*) FROM sample_product_data").collect()
—
Reply to this email directly, view it on GitHub
<#24456 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATSRCU6DER2X3CGMV3YWWI3VQI3MXANCNFSM5YZNCYZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
How about creating airflow/airflow/providers/docker/decorators/docker.py Lines 135 to 139 in c3c1f7e
airflow/airflow/decorators/python_virtualenv.py Lines 71 to 75 in c3c1f7e
Of course such operator would have to provide session: @snowpark_task
def count_user(session: Session): # session is required parameter
session.sql(...) Personally I found this more friendly than instantiating hooks or using operators. Apart from that should we consider implementing custom XCom backend based on reading query results using query-id? This can be a really nice addition to Snowpark + Airflow combo. |
I like how this looks.
Is there a way we can make it easy to grab an Airflow Connection to
instantiate a snowpark connection?
…On Wed, Jun 22, 2022 at 2:59 PM Tomek Urbaszek ***@***.***> wrote:
How about creating @snowpark_task decorator as we do for virtualenv and
docker?
https://github.com/apache/airflow/blob/c3c1f7ea2851377ba913075844a8dde9bfe6376e/airflow/providers/docker/decorators/docker.py#L135-L139
https://github.com/apache/airflow/blob/c3c1f7ea2851377ba913075844a8dde9bfe6376e/airflow/decorators/python_virtualenv.py#L71-L75
Of course such operator would have to provide session:
@snowpark_taskdef count_user(session: Session): # session is required parameter
session.sql(...)
Personally I found this more friendly than instantiating hooks or using
operators.
Apart from that should we consider implementing custom XCom backend based
on reading query results using query-id? This can be a really nice addition
to Snowpark + Airflow combo.
—
Reply to this email directly, view it on GitHub
<#24456 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATSRCUYIKI3FLVNGIKGNUPDVQNWBJANCNFSM5YZNCYZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We can add support for the optional @snowpark_task(snowflke_conn_id='my-favourite-conn')
def count_user(session: Session): # session is required parameter
session.sql(...) The official documentation recommends that custom decorators should be called -@snowpark_task
+@task.snowpark |
Looks good to me
…Sent from my iPhone
On Jun 22, 2022, at 3:52 PM, Kamil Breguła ***@***.***> wrote:
We can add support for the optional snowflke_conn_id paarameter.
@snowpark_task(snowflke_conn_id='my-favourite-conn')
def count_user(session: Session): # session is required parameter
session.sql(...)
The official documentation recommends that custom decorators should be called task.CUSTOM, so we should introduce one change to @turbaszek's proposal
***@***.***_task
***@***.***
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
Good catch @mik-laj with the |
I prepared a draft PR: #24652 |
I pulled the PR -- any advice on how to let me test locally? Doesnt appear
to be as easy as pip install -e
…On Sat, Jun 25, 2022 at 8:24 AM Kamil Breguła ***@***.***> wrote:
I prepared a draft PR: #24652
<#24652>
The implementation is complete, but I will want to add some documentation
to improve the adoption of this feature and system tests to avoid
regression.
—
Reply to this email directly, view it on GitHub
<#24456 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATSRCU6WXQJAGK3LPPABHYLVQ4CCVANCNFSM5YZNCYZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I would recommend using breeze - Airflow development environment: |
@sfc-gh-madkins I wonder if I should set
Does Snowflake have any recommendations for ISV on how to use the I also wonder what to do to make Snowflake customers know about this integration. Currently, support for Snowflake in Airflow is not mentioned in the end-customer Snowflake documentation. For example, we can add Apache Airflow to the ecosystem section. For comparison, dbt has explicitly described Apache Airflow, and they even published a guide on how to use Apache Airflow with dbt cloud. |
Tomek, Kamil -- Are we ready to merge code to main?
ISV doesnt have any best practices when it comes to using "query_tag". You
should see in the airflow snowflake connection instantiation code, an
"application" parameter gets set for us to track consumption.
I am working with the docs team to get an ecosystem section added for
Snowpark. I can also coordinate other GTM activities that we might want to
do with one or more airflow managed providers.
…On Sun, Jun 26, 2022 at 7:35 AM Kamil Breguła ***@***.***> wrote:
@sfc-gh-madkins <https://github.com/sfc-gh-madkins> I wonder if I should
set QUERY_TAG
<https://docs.snowflake.com/en/sql-reference/parameters.html#query-tag>
and pass the following values:
- Instance name
<https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#instance-name>
- DAG ID
- Task ID
- Run ID
Does Snowflake have any recommendations for ISV on how to use the
query_tag that we can use here? Unfortunately, the public documentation
is quite poor on the description of this parameter and I do not know the
best practices on how it should be used.
I also wonder what to do to make Snowflake customers know about this
integration. Currently, support for Snowflake in Airflow is not mentioned
in the end-customer Snowflake documentation. For example, we can add Apache
Airflow to the ecosystem section
<https://docs.snowflake.com/en/user-guide/ecosystem.html>.
There is also integration for Apache Beam added, but under the commercial
name - Google Dataflow, so adding Apache Airflow under the commercial name
- Cloud Composer <https://cloud.google.com/composer> is one solution.
[image: Screenshot 2022-06-26 at 14 22 46]
<https://user-images.githubusercontent.com/12058428/175813909-ca28f942-1832-4bb2-8c42-b281bc88c3c0.png>
But Apache Airflow has more service providers, e.g. Amazon Managed
Workflows for Apache Airflow (MWAA)
<https://docs.aws.amazon.com/mwaa/latest/userguide/what-is-mwaa.html> or
Astronomer <https://www.astronomer.io/>. In addition, Apache Beam can
also be run on-premises when using Apache Spark runner instead of Google
Dataflow Runner, which also allows us to use SnowflakeIO for Apache Beam
without Google Dataflow.
For comparison, dbt has explicitly described Apache Airflow, and they even
published a guide on how to use Apache Airflow with dbt cloud.
https://docs.getdbt.com/guides/orchestration/airflow-and-dbt-cloud/1-airflow-and-dbt-cloud
https://docs.getdbt.com/docs/running-a-dbt-project/running-dbt-in-production
CC: @sfc-gh-pgancz <https://github.com/sfc-gh-pgancz>, @sfc-gh-sleslie
<https://github.com/sfc-gh-sleslie>
—
Reply to this email directly, view it on GitHub
<#24456 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATSRCU6T4ABIDOCNAZDWJ4LVRBFA5ANCNFSM5YZNCYZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@sfc-gh-madkins I still have to add documentation and system tests. |
I was hoping to push a blog post out by end of July -- I also dont know
what Airflow's release cycle is. Does that seem doable?
Miles
…On Wed, Jul 6, 2022 at 5:00 PM Kamil Breguła ***@***.***> wrote:
@sfc-gh-madkins <https://github.com/sfc-gh-madkins> I still have to add
documentation and system tests.
—
Reply to this email directly, view it on GitHub
<#24456 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATSRCU7V3HUBGT3P35LJZRLVSX6YHANCNFSM5YZNCYZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
For providers - roughly monthly, I am planning to release new providers this week so if it is not going to be merged tomorrow/maybe day after, it will have to wait for next month. |
Kamil -- Any update on this from your side?
…On Mon, Jul 11, 2022 at 6:36 AM Jarek Potiuk ***@***.***> wrote:
For providers - roughly monthly, I am planning to release new providers
this week so if it is not going to be merged tomorrow/maybe day after, it
will have to wait for next month.
—
Reply to this email directly, view it on GitHub
<#24456 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATSRCU2YZIMGLJGRZSPLBZDVTQBKFANCNFSM5YZNCYZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We have a customer interested in this :)
On Tue, Aug 2, 2022 at 3:43 PM Miles Adkins ***@***.***>
wrote:
… Kamil -- Any update on this from your side?
On Mon, Jul 11, 2022 at 6:36 AM Jarek Potiuk ***@***.***>
wrote:
> For providers - roughly monthly, I am planning to release new providers
> this week so if it is not going to be merged tomorrow/maybe day after, it
> will have to wait for next month.
>
> —
> Reply to this email directly, view it on GitHub
> <#24456 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ATSRCU2YZIMGLJGRZSPLBZDVTQBKFANCNFSM5YZNCYZA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Yesterday I was working on system tests. Now only the documentation remains, which I also hope to prepare within a week. Would a customer be willing to test this PR before it is released by Apache? I can prepare a |
Yeah they would definitely be open to testing it out
…On Wed, Aug 3, 2022 at 10:37 AM Kamil Breguła ***@***.***> wrote:
Yesterday I was working on system tests. Now only the documentation
remains, which I also hope to prepare within a week.
Would a customer be willing to test this PR before it is released by
Apache? I can prepare a whl package that the user will have to install
explicitly.
—
Reply to this email directly, view it on GitHub
<#24456 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATSRCU5TFN3NQ3HEMSARL5DVXKG4TANCNFSM5YZNCYZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Question for an interested Snowflake customer: Will the release of the
Snowflake Airflow provider package that includes the Snowpark operator be
compatible with Airflow 2.2.2?
On Wed, Aug 3, 2022 at 11:01 AM Miles Adkins ***@***.***>
wrote:
… Yeah they would definitely be open to testing it out
On Wed, Aug 3, 2022 at 10:37 AM Kamil Breguła ***@***.***>
wrote:
> Yesterday I was working on system tests. Now only the documentation
> remains, which I also hope to prepare within a week.
>
> Would a customer be willing to test this PR before it is released by
> Apache? I can prepare a whl package that the user will have to install
> explicitly.
>
> —
> Reply to this email directly, view it on GitHub
> <#24456 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ATSRCU5TFN3NQ3HEMSARL5DVXKG4TANCNFSM5YZNCYZA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
I still need to check it out, but for now I don't see any reasons why it should work on older versions of Apache Airflow. |
Just to explain the process @sfc-gh-madkins (@mik-laj already knows most of it) - but I think it would be good that you understand this and can arrange your process with your customers as well, so that there are no surprises and so that you can adapt to the rules we agreed in the community. The rule we have for Airflow providers is that they should be 2.2+ compatible till 11th of October:
We are attempting to test the 2.2 compatibility basically with every PR - at laast to the level of whether the provider "imports" well on airflow 2.2: https://github.com/apache/airflow/runs/7872249712?check_suite_focus=true . The API (not yet formalized but soon will be when we split providers out to separate repos) between the Providers and Airflow are rather stable ones and we are very carefully reviewing any parts that might break potentially. But accidental breakages happen (which we fix if we find them). Ideally, those errors are detected during manual testing of RC candidates and for that we kindly ask people employed by the stakeholders and those who are involved in preparing changes to test it when we release RC candidates. There are at least 72 hours to test the releases and you are (including yourself @sfc-gh-madkins) heavily encouraged to run those tests when we release the provider. But involving your customer is also encouraged. We announce it at the devlist (roughly monthly) with some early announcements that we are about to prepare the new wave of providers and we create an issue like this one when RC candidates are ready, where you can coment and raise issues or confirm that things work as expected (the latter is very encouraged :): #25640 when we prepare RC candidates. The last one is a good example where Databricks employees tested the Databricks provider and found out a slight Airlfow 2.2 compatibility issue (so we have RC4 candidate of databricks now with it fixed). You can see the discussion to see how it can play out. So - to summarise - the best way to be "sure" that a provider works on Airflow 2.2 is to help the community by performing the tests of RC candidates on 2.2 - and possibly even engage your customers to run such testing with the release candidates. And this is also a good answer to the customers of yours - if they ask if it works with earlier airflow version, the best answer is "yes, they are intended to be, but the best way to make sure of that is if you can help with testing the RC candidate" and inform them when the RC is out :).The RCs are installable from PyPI - same as any other versions - but they need to be installed deliberately, specifying the version to install. One important note. As mentioned above, the last 2.2-compatible release of providers will be at the beginning of October, so the November wave of providers will have 2.3+ compatibilty only. Then all providers will have their major version bumped. |
I prepared a test package: https://drive.google.com/drive/folders/1UQcXJuhiIhLR39y3qSlHII2LeCwyI-yh |
Kamil — happy to let you put hands on the keyboard. There is multiple ways this can be done. I’ll put time on calendar for us to connect.
…Sent from my iPhone
On Jun 15, 2022, at 3:47 AM, Kamil Breguła ***@***.***> wrote:
I would like to work on this.
Disclaimer: I work for Snowflake, but I contribute to Airflow during non-business hours.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
Any update on this, I am happy to write something and raise the pr , this weekend ! |
@dhruvk12 please do :) |
hi all. looks like the PR mentioned above has gone stale because of the cloudpickle stuff in snowpark. seems related to this snowpark PR that has also gone stale. there also appears to be an initiative from astronomer to do a similar thing but the demo repo appears to depend on an archived underlying provider. as someone interested in this, what are the next steps here? has someone already made a community fork of this to get around the cloudpickle+python3.8 issues in the original PR? is there interest in that community fork? |
maybe @kaxil is able to answer the question above about the astronomer packages. |
@ryaminal I have started a PR at https://github.com/mpgreg/airflow/tree/main/airflow/providers/snowflake @mik-laj @sfc-gh-madkins I also ran into the same cloudpickle dependency issues. Anything you can do to help move snowflakedb/snowpark-python#975 along?
|
Do the problem still occur when you use the RC version of Snowflake connector? This version doesn't depends on apache-arrow, so it's less risky. |
@mik-laj , the RC version depends on nanoarrow right? I don't think that changes the dependency on cloudpickle, does it? |
This changes the situation a bit, because Apache Beam requires a specific version of Apache Arrow and Cloudpickle, so there is a chance that if the Snowflake connector does not depend on apache-arrow, we will be able to find a matching version of Apache Beam. However, this is a theory that we need to test in practice. |
I set |
There are two separate PRs here. Actually I haven't made a PR yet so technically only one. But @mik-laj it would be great if you can take a look at my decorators and operators to see if it makes sense to merge with your PR. At a high level I build from python virtualenv operator. I built this when Snowpark only supported python 3.8 and there were other dependency issues with Airflow. Snowpark now supports 3.8-3.11 but in my experience there will likely always be challenges with major versions. Inheriting from virtualenv operator will simplify dependency management. I also craeted a SnowparkTable dataclass in order to serialize/deserialize Snowpark Dataframes passed to/from a task. This is needed because the Dataframes are associated with the Snowpark Session which doesn't (or didn't at the time) serialize. So the user can't create a session in the DAG and pass it between tasks. Instead, a new session is created for each task, any SnowparkTable arguments passed to the task are instantiated as Snowpark Dataframes in the session and any Snowpark Dataframes returned from the session are serialized to snowflake as tables or stage objects (user choice). The creation of the snowpark session and the ser/des of SnowparkTable objects is accomplished in a new virtualenv jinja template Lastly, and somewhat unrelated, I created a new custom XCOM Backend for Snowflake. Some users (specifically in regulated industries) wanted an option to keep all data in Snowflake so all task in/output is saved to Snowflake objects (tables or stages) and only a uri ( Operators
Decorators
|
Looks good @mpgreg |
I am just curious how the testing of the operator will take place? Any tests written for this? |
@mpgreg lets push this one over the finish line! |
I think it's very unlikely that the community will push it. It has been open for almost 2 years. I think we better close this issue as it's probably better fit to be in Snowflake backlog rather in the community one. If someone thinks differently we can reopen. |
Hi there, I'm Jianzhun from the Snowflake (Snowpark Engineering team), and I would like to pick up this issue and get Snowpark supported officially in Airflow! May we re-open this issue? |
@sfc-gh-jdu Great to hear that --- its a long standing issue -- eagerly waiting to see it through the finish line |
Hi folks, FYI the first PR is ready for review in #42457. Any comment is appreciated! |
Description
A Snowpark operator (similar to the Snowflake Operator) that executes Snowpark for Python Code. Credentials are handled the same way for Snowflake Operator. Code is executed the same ways to the Python Operator.
Use case/motivation
I would like to be able to to schedule snowflake-snowpark-python code that I have developed. Instead of having to write lengthy SQL statements, I would prefer to develop in my language of choice.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: