Feature/python model v1 #377

ChenyuLInx · 2022-06-29T04:37:46Z

This change currently includes table materialization and incremental materialization.
Not all changes should live in this repo, certain parts will be moved to dbt-databricks

Also super happy to hear any feedback and what you think we missed.

Update version to 1.3.0a1. Teensy other changes

Feature/python model v1 incremental

ueshin · 2022-07-08T00:28:20Z

@ChenyuLInx I ported the changes here that I think should be in dbt-databricks databricks/dbt-databricks#126

jtcohen6 · 2022-07-26T16:15:37Z

We're planning to merge all the code in this PR, and include it in a beta release of dbt-spark v1.3. More on our rationale: #407

ChenyuLInx · 2022-07-27T20:15:57Z

The only test that fails here acutually failed on main. I would consider it no blocker for this PR. Maybe we create a separate ticket to resolve it ? @nathaniel-may
EDIT: issue created. #410

gshank · 2022-07-27T20:54:01Z

The failing test is strange. I saw that when I did the incremental change (one line) but it was working when I merged and hasn't shown up in the alerts. But it's fine to look at separately; it's definitely unrelated.

gshank

Looks good. After the core branch is merged you'll have to update dev-requirements.txt...

ChenyuLInx · 2022-07-27T22:14:56Z

The failing test is strange. I saw that when I did the incremental change (one line) but it was working when I merged and hasn't shown up in the alerts. But it's fine to look at separately; it's definitely unrelated.

@gshank I looked at the code I rebased and it doesn't looks like it should cause any problem. But given the fact that it fails on the main branch and we have an issue tracking it, do you think we should skip that tests for now so other folks don't also got blocked that?
EDIT: I skipped it with the latest commit
EDIT: some tests that doesn't fail before starts to fail after some totally unrelated comment change.
EDIT: fixed by Jeremy enable the settings in cluster. moved the tests back

JCZuurmond

Hi @ChenyuLInx , I am a little late to the show, still have some questions for you.

JCZuurmond · 2022-08-08T19:43:07Z

dbt/adapters/spark/impl.py

+            json={
+                "run_name": "debug task",
+                "existing_cluster_id": self.connections.profile.credentials.cluster,
+                "notebook_task": {


Why not use the spark_python_task. IMHO it is cleaner than notebooks, also, I expect you do not require the user to be stated when using the spark_python_task

Hey @JCZuurmond, thanks for pointing me to another method here!

This method is being used because it will leave a notebook after the run that you can play with and iterate you python model code there. But I do agree that case is more suitable during development phase.

I looked up spark_python_task, seems like if we want to do it that way, we will still need to upload the python file somewhere(s3 or DBFS) and pass in the path here. In that case we will need to have extra setup required to put that python file, vs currently we require the extra user you also mentioned in the next comment to create a folder and put the notebooks in.

Happy to hear more thoughts on this and pivot to the other approach for production runs!

This method is being used because it will leave a notebook after the run that you can play with and iterate you python model code there. But I do agree that case is more suitable during development phase.

I understand this is useful during development, though, it is unexpected behavior to me. This does not happen for the SQL models (we could also upload the SQL in a notebook and run the notebook as a job). And, it requires a user for the production system, which was not required before.

I looked up spark_python_task, seems like if we want to do it that way, we will still need to upload the python file somewhere(s3 or DBFS) and pass in the path here. In that case we will need to have extra setup required to put that python file, vs currently we require the extra user you also mentioned in the next comment to create a folder and put the notebooks in.

I would use a certain convention, for example that we upload the scripts in dbfs:/dbt/<project name>/<database name>/<model name>.py. This would be similar like the database field in the profile that is used as prefix for your schema name. This eliminates the need for the user fields and mimics existing dbt behavior: like the location in external tables.

And maybe the create dirs is not needed, I don't think it is for the spark_python_task.

Created #424 for this, feel free to update that issue! Thank you so much for the feedback!!

JCZuurmond · 2022-08-08T19:45:29Z

dbt/adapters/spark/impl.py

+
+        # create new dir
+        if not self.connections.profile.credentials.user:
+            raise ValueError("Need to supply user in profile to submit python job")


What user is expected to be used in the automated/scheduled dbt jobs for the production system? I think this implies a user should be created for that system.

The initial thought is that user would be the Databricks user who created the token to use for production environment. But following the discussion in the above thread, if we pivot to do spark_python_task, then this could be different setup on production(configs needed for s3 or DBFS)

Let's continue the discussion in the other thread. I would be in favor of not requiring a user to be stated.

JCZuurmond · 2022-08-08T20:00:23Z

dbt/include/spark/macros/materializations/table.sql

+{{ compiled_code }}
+
+# --- Autogenerated dbt code below this line. Do not modify. --- #
+dbt = dbtObj(spark.table)


I think we can make the spark more explicit and thus not expect the notebook to magically insert this global variable by adding above {{ compiled_code }}:

from pyspark.sql import SparkSession session = SparkSession.builder.getOrCreate()

JCZuurmond · 2022-08-08T20:16:12Z

dbt/include/spark/macros/adapters.sql

+    N.B. Python models _can_ write to temp views HOWEVER they use a different session
+    and have already expired by the time they need to be used (I.E. in merges for incremental models)
+
+    TODO: Deep dive into spark sessions to see if we can reuse a single session for an entire


Isn't this a result of using jobs? I think each job always has a different Spark session

This is the comment from @iknox-fa.

The main issue here is that the python part of the model building have it's session from the jobs but the rest of the logic for the model have another session, we will have to delete the python tmp table after the merge logic(using existing SQL) vs if it is all SQL we can just make a true tmp table and it will be gone after current dbt model finishes

### Description Ports the changes for python model v1 from [`dbt-spark`](dbt-labs/dbt-spark#377) but use APIs below instead. - [Create an execution context](https://docs.databricks.com/dev-tools/api/1.2/index.html#create-an-execution-context ) - [Run a command](https://docs.databricks.com/dev-tools/api/1.2/index.html#run-a-command) - [Get information about a command](https://docs.databricks.com/dev-tools/api/1.2/index.html#get-information-about-a-command) - loop until the command ends - [Delete an execution context](https://docs.databricks.com/dev-tools/api/1.2/index.html#delete-an-execution-context)

ChenyuLInx and others added 30 commits May 3, 2022 17:47

first databrick implementation

ae4c60f

add cell to notebook

ecece22

proper return run result

77dcf6b

properly make function available

a4211a9

ref return df

2e2cae1

Bump version to 1.3.0a1

d69fe4c

Small quality of life fixups

c79991d

update more result

ccdc170

Merge pull request #2 from dbt-labs/jerco/update-version

4c9f522

Update version to 1.3.0a1. Teensy other changes

fix format

be2b0a2

better error handling for api call and target relation templating

4195ccd

fix format

98f60e7

fix format

0a6e673

add functional test

c29867e

Existing python model code

cb5ba0d

first pass

f87a30b

cleanup , pt 1

d6ac3b9

cleaned up incremental logic

ca04f35

cleanup, add is_incremental

d639594

remove debug logging

f5c178e

flake8

7a44feb

removed python lang from temp views for now

d7b06d4

add change schema test

4d4ae51

removed log line

8b95b2e

Merge pull request #3 from dbt-labs/feature/python-model-v1-incremental

2e18ae3

Feature/python model v1 incremental

more restiction and adjust syntax

a758930

adjust name for incremental model

88b7ad4

stage changes

85a49ae

fixed it

3ee0a42

update function args and adjust more names

b157b6e

ueshin mentioned this pull request Jul 11, 2022

Feature/python model beta databricks/dbt-databricks#126

Merged

ChenyuLInx added 5 commits July 14, 2022 17:00

remove unneed macro

d596866

minic result for python job

27c1441

fix python model test (#406)

9eea396

Merge branch 'main' into feature/python-model-v1

cc003d9

make black happy

c907a00

ChenyuLInx marked this pull request as ready for review July 26, 2022 04:34

ChenyuLInx requested review from McKnight-42, stu-k, VersusFacit, gshank and jtcohen6 July 26, 2022 04:38

enable python model test (#409)

0aebf04

ChenyuLInx mentioned this pull request Jul 27, 2022

[CT-947] Test_incremental_strategy failed on main #410

Closed

gshank approved these changes Jul 27, 2022

View reviewed changes

ChenyuLInx added 4 commits July 27, 2022 16:42

skip test that failed on main

6efee9e

add comment to run code

7e8943b

using core code and bring back incremental test

34f144e

add changelog

fa303d9

ChenyuLInx merged commit f58fc23 into main Jul 28, 2022

ChenyuLInx deleted the feature/python-model-v1 branch July 28, 2022 20:52

JCZuurmond reviewed Aug 8, 2022

View reviewed changes

ChenyuLInx mentioned this pull request Aug 9, 2022

[CT-1021] Avoid creating notebook as the default way of running python model #424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/python model v1 #377

Feature/python model v1 #377

ChenyuLInx commented Jun 29, 2022 •

edited

Loading

ueshin commented Jul 8, 2022

jtcohen6 commented Jul 26, 2022

ChenyuLInx commented Jul 27, 2022 •

edited

Loading

gshank commented Jul 27, 2022

gshank left a comment

ChenyuLInx commented Jul 27, 2022 •

edited

Loading

JCZuurmond left a comment

JCZuurmond Aug 8, 2022 •

edited

Loading

ChenyuLInx Aug 8, 2022

JCZuurmond Aug 9, 2022 •

edited

Loading

ChenyuLInx Aug 10, 2022 •

edited

Loading

JCZuurmond Aug 8, 2022

ChenyuLInx Aug 8, 2022

JCZuurmond Aug 9, 2022

JCZuurmond Aug 8, 2022 •

edited

Loading

JCZuurmond Aug 8, 2022

ChenyuLInx Aug 8, 2022

Feature/python model v1 #377

Feature/python model v1 #377

Conversation

ChenyuLInx commented Jun 29, 2022 • edited Loading

ueshin commented Jul 8, 2022

jtcohen6 commented Jul 26, 2022

ChenyuLInx commented Jul 27, 2022 • edited Loading

gshank commented Jul 27, 2022

gshank left a comment

Choose a reason for hiding this comment

ChenyuLInx commented Jul 27, 2022 • edited Loading

JCZuurmond left a comment

Choose a reason for hiding this comment

JCZuurmond Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

ChenyuLInx Aug 8, 2022

Choose a reason for hiding this comment

JCZuurmond Aug 9, 2022 • edited Loading

Choose a reason for hiding this comment

ChenyuLInx Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

JCZuurmond Aug 8, 2022

Choose a reason for hiding this comment

ChenyuLInx Aug 8, 2022

Choose a reason for hiding this comment

JCZuurmond Aug 9, 2022

Choose a reason for hiding this comment

JCZuurmond Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

JCZuurmond Aug 8, 2022

Choose a reason for hiding this comment

ChenyuLInx Aug 8, 2022

Choose a reason for hiding this comment

ChenyuLInx commented Jun 29, 2022 •

edited

Loading

ChenyuLInx commented Jul 27, 2022 •

edited

Loading

ChenyuLInx commented Jul 27, 2022 •

edited

Loading

JCZuurmond Aug 8, 2022 •

edited

Loading

JCZuurmond Aug 9, 2022 •

edited

Loading

ChenyuLInx Aug 10, 2022 •

edited

Loading

JCZuurmond Aug 8, 2022 •

edited

Loading