-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/python model v1 #377
Conversation
Update version to 1.3.0a1. Teensy other changes
Feature/python model v1 incremental
@ChenyuLInx I ported the changes here that I think should be in |
We're planning to merge all the code in this PR, and include it in a beta release of |
The only test that fails here acutually failed on main. I would consider it no blocker for this PR. Maybe we create a separate ticket to resolve it ? @nathaniel-may |
The failing test is strange. I saw that when I did the incremental change (one line) but it was working when I merged and hasn't shown up in the alerts. But it's fine to look at separately; it's definitely unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. After the core branch is merged you'll have to update dev-requirements.txt...
@gshank I looked at the code I rebased and it doesn't looks like it should cause any problem. But given the fact that it fails on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ChenyuLInx , I am a little late to the show, still have some questions for you.
json={ | ||
"run_name": "debug task", | ||
"existing_cluster_id": self.connections.profile.credentials.cluster, | ||
"notebook_task": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the spark_python_task
. IMHO it is cleaner than notebooks, also, I expect you do not require the user to be stated when using the spark_python_task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @JCZuurmond, thanks for pointing me to another method here!
This method is being used because it will leave a notebook after the run that you can play with and iterate you python model code there. But I do agree that case is more suitable during development phase.
I looked up spark_python_task
, seems like if we want to do it that way, we will still need to upload the python file somewhere(s3 or DBFS) and pass in the path here. In that case we will need to have extra setup required to put that python file, vs currently we require the extra user
you also mentioned in the next comment to create a folder and put the notebooks in.
Happy to hear more thoughts on this and pivot to the other approach for production
runs!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is being used because it will leave a notebook after the run that you can play with and iterate you python model code there. But I do agree that case is more suitable during development phase.
I understand this is useful during development, though, it is unexpected behavior to me. This does not happen for the SQL models (we could also upload the SQL in a notebook and run the notebook as a job). And, it requires a user for the production system, which was not required before.
I looked up spark_python_task, seems like if we want to do it that way, we will still need to upload the python file somewhere(s3 or DBFS) and pass in the path here. In that case we will need to have extra setup required to put that python file, vs currently we require the extra user you also mentioned in the next comment to create a folder and put the notebooks in.
I would use a certain convention, for example that we upload the scripts in dbfs:/dbt/<project name>/<database name>/<model name>.py
. This would be similar like the database
field in the profile that is used as prefix for your schema name. This eliminates the need for the user fields and mimics existing dbt behavior: like the location in external tables.
And maybe the create dirs is not needed, I don't think it is for the spark_python_task
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #424 for this, feel free to update that issue! Thank you so much for the feedback!!
|
||
# create new dir | ||
if not self.connections.profile.credentials.user: | ||
raise ValueError("Need to supply user in profile to submit python job") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What user is expected to be used in the automated/scheduled dbt jobs for the production system? I think this implies a user should be created for that system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial thought is that user would be the Databricks user who created the token to use for production environment. But following the discussion in the above thread, if we pivot to do spark_python_task
, then this could be different setup on production(configs needed for s3 or DBFS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue the discussion in the other thread. I would be in favor of not requiring a user to be stated.
{{ compiled_code }} | ||
|
||
# --- Autogenerated dbt code below this line. Do not modify. --- # | ||
dbt = dbtObj(spark.table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make the spark
more explicit and thus not expect the notebook to magically insert this global variable by adding above {{ compiled_code }}
:
from pyspark.sql import SparkSession
session = SparkSession.builder.getOrCreate()
N.B. Python models _can_ write to temp views HOWEVER they use a different session | ||
and have already expired by the time they need to be used (I.E. in merges for incremental models) | ||
|
||
TODO: Deep dive into spark sessions to see if we can reuse a single session for an entire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a result of using jobs? I think each job always has a different Spark session
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the comment from @iknox-fa.
The main issue here is that the python part of the model building have it's session from the jobs but the rest of the logic for the model have another session, we will have to delete the python tmp table after the merge logic(using existing SQL) vs if it is all SQL we can just make a true tmp table and it will be gone after current dbt model finishes
### Description Ports the changes for python model v1 from [`dbt-spark`](dbt-labs/dbt-spark#377) but use APIs below instead. - [Create an execution context](https://docs.databricks.com/dev-tools/api/1.2/index.html#create-an-execution-context ) - [Run a command](https://docs.databricks.com/dev-tools/api/1.2/index.html#run-a-command) - [Get information about a command](https://docs.databricks.com/dev-tools/api/1.2/index.html#get-information-about-a-command) - loop until the command ends - [Delete an execution context](https://docs.databricks.com/dev-tools/api/1.2/index.html#delete-an-execution-context)
This change currently includes table materialization and incremental materialization.
Not all changes should live in this repo, certain parts will be moved to dbt-databricks
Also super happy to hear any feedback and what you think we missed.