Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/python model v1 #377

Merged
merged 40 commits into from
Jul 28, 2022
Merged

Feature/python model v1 #377

merged 40 commits into from
Jul 28, 2022

Conversation

ChenyuLInx
Copy link
Contributor

@ChenyuLInx ChenyuLInx commented Jun 29, 2022

This change currently includes table materialization and incremental materialization.
Not all changes should live in this repo, certain parts will be moved to dbt-databricks

Also super happy to hear any feedback and what you think we missed.

@ueshin
Copy link
Contributor

ueshin commented Jul 8, 2022

@ChenyuLInx I ported the changes here that I think should be in dbt-databricks databricks/dbt-databricks#126

@ChenyuLInx ChenyuLInx marked this pull request as ready for review July 26, 2022 04:34
@jtcohen6
Copy link
Contributor

We're planning to merge all the code in this PR, and include it in a beta release of dbt-spark v1.3. More on our rationale: #407

@ChenyuLInx
Copy link
Contributor Author

ChenyuLInx commented Jul 27, 2022

The only test that fails here acutually failed on main. I would consider it no blocker for this PR. Maybe we create a separate ticket to resolve it ? @nathaniel-may
EDIT: issue created. #410

@gshank
Copy link
Contributor

gshank commented Jul 27, 2022

The failing test is strange. I saw that when I did the incremental change (one line) but it was working when I merged and hasn't shown up in the alerts. But it's fine to look at separately; it's definitely unrelated.

Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. After the core branch is merged you'll have to update dev-requirements.txt...

@ChenyuLInx
Copy link
Contributor Author

ChenyuLInx commented Jul 27, 2022

The failing test is strange. I saw that when I did the incremental change (one line) but it was working when I merged and hasn't shown up in the alerts. But it's fine to look at separately; it's definitely unrelated.

@gshank I looked at the code I rebased and it doesn't looks like it should cause any problem. But given the fact that it fails on the main branch and we have an issue tracking it, do you think we should skip that tests for now so other folks don't also got blocked that?
EDIT: I skipped it with the latest commit
EDIT: some tests that doesn't fail before starts to fail after some totally unrelated comment change.
EDIT: fixed by Jeremy enable the settings in cluster. moved the tests back

@ChenyuLInx ChenyuLInx merged commit f58fc23 into main Jul 28, 2022
@ChenyuLInx ChenyuLInx deleted the feature/python-model-v1 branch July 28, 2022 20:52
Copy link
Collaborator

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ChenyuLInx , I am a little late to the show, still have some questions for you.

json={
"run_name": "debug task",
"existing_cluster_id": self.connections.profile.credentials.cluster,
"notebook_task": {
Copy link
Collaborator

@JCZuurmond JCZuurmond Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the spark_python_task. IMHO it is cleaner than notebooks, also, I expect you do not require the user to be stated when using the spark_python_task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JCZuurmond, thanks for pointing me to another method here!

This method is being used because it will leave a notebook after the run that you can play with and iterate you python model code there. But I do agree that case is more suitable during development phase.

I looked up spark_python_task, seems like if we want to do it that way, we will still need to upload the python file somewhere(s3 or DBFS) and pass in the path here. In that case we will need to have extra setup required to put that python file, vs currently we require the extra user you also mentioned in the next comment to create a folder and put the notebooks in.

Happy to hear more thoughts on this and pivot to the other approach for production runs!

Copy link
Collaborator

@JCZuurmond JCZuurmond Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is being used because it will leave a notebook after the run that you can play with and iterate you python model code there. But I do agree that case is more suitable during development phase.

I understand this is useful during development, though, it is unexpected behavior to me. This does not happen for the SQL models (we could also upload the SQL in a notebook and run the notebook as a job). And, it requires a user for the production system, which was not required before.

I looked up spark_python_task, seems like if we want to do it that way, we will still need to upload the python file somewhere(s3 or DBFS) and pass in the path here. In that case we will need to have extra setup required to put that python file, vs currently we require the extra user you also mentioned in the next comment to create a folder and put the notebooks in.

I would use a certain convention, for example that we upload the scripts in dbfs:/dbt/<project name>/<database name>/<model name>.py. This would be similar like the database field in the profile that is used as prefix for your schema name. This eliminates the need for the user fields and mimics existing dbt behavior: like the location in external tables.

And maybe the create dirs is not needed, I don't think it is for the spark_python_task.

Copy link
Contributor Author

@ChenyuLInx ChenyuLInx Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #424 for this, feel free to update that issue! Thank you so much for the feedback!!


# create new dir
if not self.connections.profile.credentials.user:
raise ValueError("Need to supply user in profile to submit python job")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What user is expected to be used in the automated/scheduled dbt jobs for the production system? I think this implies a user should be created for that system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial thought is that user would be the Databricks user who created the token to use for production environment. But following the discussion in the above thread, if we pivot to do spark_python_task, then this could be different setup on production(configs needed for s3 or DBFS)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue the discussion in the other thread. I would be in favor of not requiring a user to be stated.

{{ compiled_code }}

# --- Autogenerated dbt code below this line. Do not modify. --- #
dbt = dbtObj(spark.table)
Copy link
Collaborator

@JCZuurmond JCZuurmond Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the spark more explicit and thus not expect the notebook to magically insert this global variable by adding above {{ compiled_code }}:

from pyspark.sql import SparkSession
session = SparkSession.builder.getOrCreate()

N.B. Python models _can_ write to temp views HOWEVER they use a different session
and have already expired by the time they need to be used (I.E. in merges for incremental models)

TODO: Deep dive into spark sessions to see if we can reuse a single session for an entire
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a result of using jobs? I think each job always has a different Spark session

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the comment from @iknox-fa.

The main issue here is that the python part of the model building have it's session from the jobs but the rest of the logic for the model have another session, we will have to delete the python tmp table after the merge logic(using existing SQL) vs if it is all SQL we can just make a true tmp table and it will be gone after current dbt model finishes

ueshin added a commit to databricks/dbt-databricks that referenced this pull request Sep 13, 2022
### Description

Ports the changes for python model v1 from [`dbt-spark`](dbt-labs/dbt-spark#377) but use APIs below instead.

- [Create an execution context](https://docs.databricks.com/dev-tools/api/1.2/index.html#create-an-execution-context
)
- [Run a command](https://docs.databricks.com/dev-tools/api/1.2/index.html#run-a-command)
- [Get information about a command](https://docs.databricks.com/dev-tools/api/1.2/index.html#get-information-about-a-command)
  - loop until the command ends
- [Delete an execution context](https://docs.databricks.com/dev-tools/api/1.2/index.html#delete-an-execution-context)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants