-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/python model v1 #377
Changes from all commits
ae4c60f
ecece22
77dcf6b
a4211a9
2e2cae1
d69fe4c
c79991d
ccdc170
4c9f522
be2b0a2
4195ccd
98f60e7
0a6e673
c29867e
cb5ba0d
f87a30b
d6ac3b9
ca04f35
d639594
f5c178e
7a44feb
d7b06d4
4d4ae51
8b95b2e
2e18ae3
a758930
88b7ad4
85a49ae
3ee0a42
b157b6e
d596866
27c1441
9eea396
cc003d9
c907a00
0aebf04
6efee9e
7e8943b
34f144e
fa303d9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,7 @@ | ||
import re | ||
import requests | ||
import time | ||
import base64 | ||
from concurrent.futures import Future | ||
from dataclasses import dataclass | ||
from typing import Any, Dict, Iterable, List, Optional, Union | ||
|
@@ -11,7 +14,8 @@ | |
import dbt.exceptions | ||
|
||
from dbt.adapters.base import AdapterConfig | ||
from dbt.adapters.base.impl import catch_as_completed | ||
from dbt.adapters.base.impl import catch_as_completed, log_code_execution | ||
from dbt.adapters.base.meta import available | ||
from dbt.adapters.sql import SQLAdapter | ||
from dbt.adapters.spark import SparkConnectionManager | ||
from dbt.adapters.spark import SparkRelation | ||
|
@@ -159,11 +163,9 @@ def list_relations_without_caching( | |
|
||
return relations | ||
|
||
def get_relation( | ||
self, database: Optional[str], schema: str, identifier: str | ||
) -> Optional[BaseRelation]: | ||
def get_relation(self, database: str, schema: str, identifier: str) -> Optional[BaseRelation]: | ||
if not self.Relation.include_policy.database: | ||
database = None | ||
database = None # type: ignore | ||
|
||
return super().get_relation(database, schema, identifier) | ||
|
||
|
@@ -296,7 +298,12 @@ def get_catalog(self, manifest): | |
for schema in schemas: | ||
futures.append( | ||
tpe.submit_connected( | ||
self, schema, self._get_one_catalog, info, [schema], manifest | ||
self, | ||
schema, | ||
self._get_one_catalog, | ||
info, | ||
[schema], | ||
manifest, | ||
) | ||
) | ||
catalogs, exceptions = catch_as_completed(futures) | ||
|
@@ -380,6 +387,114 @@ def run_sql_for_tests(self, sql, fetch, conn): | |
finally: | ||
conn.transaction_open = False | ||
|
||
@available.parse_none | ||
@log_code_execution | ||
def submit_python_job(self, parsed_model: dict, compiled_code: str, timeout=None): | ||
# TODO improve the typing here. N.B. Jinja returns a `jinja2.runtime.Undefined` instead | ||
# of `None` which evaluates to True! | ||
|
||
# TODO limit this function to run only when doing the materialization of python nodes | ||
|
||
# assuming that for python job running over 1 day user would mannually overwrite this | ||
schema = getattr(parsed_model, "schema", self.config.credentials.schema) | ||
identifier = parsed_model["alias"] | ||
if not timeout: | ||
timeout = 60 * 60 * 24 | ||
if timeout <= 0: | ||
raise ValueError("Timeout must larger than 0") | ||
|
||
auth_header = {"Authorization": f"Bearer {self.connections.profile.credentials.token}"} | ||
|
||
# create new dir | ||
if not self.connections.profile.credentials.user: | ||
raise ValueError("Need to supply user in profile to submit python job") | ||
# it is safe to call mkdirs even if dir already exists and have content inside | ||
work_dir = f"/Users/{self.connections.profile.credentials.user}/{schema}" | ||
response = requests.post( | ||
f"https://{self.connections.profile.credentials.host}/api/2.0/workspace/mkdirs", | ||
headers=auth_header, | ||
json={ | ||
"path": work_dir, | ||
}, | ||
) | ||
if response.status_code != 200: | ||
raise dbt.exceptions.RuntimeException( | ||
f"Error creating work_dir for python notebooks\n {response.content!r}" | ||
) | ||
|
||
# add notebook | ||
b64_encoded_content = base64.b64encode(compiled_code.encode()).decode() | ||
response = requests.post( | ||
f"https://{self.connections.profile.credentials.host}/api/2.0/workspace/import", | ||
headers=auth_header, | ||
json={ | ||
"path": f"{work_dir}/{identifier}", | ||
"content": b64_encoded_content, | ||
"language": "PYTHON", | ||
"overwrite": True, | ||
"format": "SOURCE", | ||
}, | ||
) | ||
if response.status_code != 200: | ||
raise dbt.exceptions.RuntimeException( | ||
f"Error creating python notebook.\n {response.content!r}" | ||
) | ||
|
||
# submit job | ||
submit_response = requests.post( | ||
f"https://{self.connections.profile.credentials.host}/api/2.1/jobs/runs/submit", | ||
headers=auth_header, | ||
json={ | ||
"run_name": "debug task", | ||
"existing_cluster_id": self.connections.profile.credentials.cluster, | ||
"notebook_task": { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @JCZuurmond, thanks for pointing me to another method here! This method is being used because it will leave a notebook after the run that you can play with and iterate you python model code there. But I do agree that case is more suitable during development phase. I looked up Happy to hear more thoughts on this and pivot to the other approach for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I understand this is useful during development, though, it is unexpected behavior to me. This does not happen for the SQL models (we could also upload the SQL in a notebook and run the notebook as a job). And, it requires a user for the production system, which was not required before.
I would use a certain convention, for example that we upload the scripts in And maybe the create dirs is not needed, I don't think it is for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Created #424 for this, feel free to update that issue! Thank you so much for the feedback!! |
||
"notebook_path": f"{work_dir}/{identifier}", | ||
}, | ||
}, | ||
) | ||
if submit_response.status_code != 200: | ||
raise dbt.exceptions.RuntimeException( | ||
f"Error creating python run.\n {response.content!r}" | ||
) | ||
|
||
# poll until job finish | ||
state = None | ||
start = time.time() | ||
run_id = submit_response.json()["run_id"] | ||
terminal_states = ["TERMINATED", "SKIPPED", "INTERNAL_ERROR"] | ||
while state not in terminal_states and time.time() - start < timeout: | ||
time.sleep(1) | ||
resp = requests.get( | ||
f"https://{self.connections.profile.credentials.host}" | ||
f"/api/2.1/jobs/runs/get?run_id={run_id}", | ||
headers=auth_header, | ||
) | ||
json_resp = resp.json() | ||
state = json_resp["state"]["life_cycle_state"] | ||
# logger.debug(f"Polling.... in state: {state}") | ||
if state != "TERMINATED": | ||
raise dbt.exceptions.RuntimeException( | ||
"python model run ended in state" | ||
f"{state} with state_message\n{json_resp['state']['state_message']}" | ||
) | ||
|
||
# get end state to return to user | ||
run_output = requests.get( | ||
f"https://{self.connections.profile.credentials.host}" | ||
f"/api/2.1/jobs/runs/get-output?run_id={run_id}", | ||
headers=auth_header, | ||
) | ||
json_run_output = run_output.json() | ||
result_state = json_run_output["metadata"]["state"]["result_state"] | ||
if result_state != "SUCCESS": | ||
raise dbt.exceptions.RuntimeException( | ||
"Python model failed with traceback as:\n" | ||
"(Note that the line number here does not " | ||
"match the line number in your code due to dbt templating)\n" | ||
f"{json_run_output['error_trace']}" | ||
) | ||
return self.connections.get_response(None) | ||
|
||
def standardize_grants_dict(self, grants_table: agate.Table) -> dict: | ||
grants_dict: Dict[str, List[str]] = {} | ||
for row in grants_table: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -117,35 +117,46 @@ | |
{%- endmacro %} | ||
|
||
|
||
{% macro create_temporary_view(relation, sql) -%} | ||
{{ return(adapter.dispatch('create_temporary_view', 'dbt')(relation, sql)) }} | ||
{% macro create_temporary_view(relation, compiled_code) -%} | ||
{{ return(adapter.dispatch('create_temporary_view', 'dbt')(relation, compiled_code)) }} | ||
{%- endmacro -%} | ||
|
||
{#-- We can't use temporary tables with `create ... as ()` syntax #} | ||
{% macro spark__create_temporary_view(relation, sql) -%} | ||
create temporary view {{ relation.include(schema=false) }} as | ||
{{ sql }} | ||
{% endmacro %} | ||
{#-- We can't use temporary tables with `create ... as ()` syntax --#} | ||
{% macro spark__create_temporary_view(relation, compiled_code) -%} | ||
create temporary view {{ relation.include(schema=false) }} as | ||
{{ compiled_code }} | ||
{%- endmacro -%} | ||
|
||
|
||
{% macro spark__create_table_as(temporary, relation, sql) -%} | ||
{% if temporary -%} | ||
{{ create_temporary_view(relation, sql) }} | ||
{%- else -%} | ||
{% if config.get('file_format', validator=validation.any[basestring]) == 'delta' %} | ||
create or replace table {{ relation }} | ||
{% else %} | ||
create table {{ relation }} | ||
{% endif %} | ||
{{ file_format_clause() }} | ||
{{ options_clause() }} | ||
{{ partition_cols(label="partitioned by") }} | ||
{{ clustered_cols(label="clustered by") }} | ||
{{ location_clause() }} | ||
{{ comment_clause() }} | ||
as | ||
{{ sql }} | ||
{%- endif %} | ||
{%- macro spark__create_table_as(temporary, relation, compiled_code, language='sql') -%} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The changes in this macro seems to be in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe |
||
{%- if language == 'sql' -%} | ||
{%- if temporary -%} | ||
{{ create_temporary_view(relation, compiled_code) }} | ||
{%- else -%} | ||
{% if config.get('file_format', validator=validation.any[basestring]) == 'delta' %} | ||
create or replace table {{ relation }} | ||
{% else %} | ||
create table {{ relation }} | ||
{% endif %} | ||
{{ file_format_clause() }} | ||
{{ options_clause() }} | ||
{{ partition_cols(label="partitioned by") }} | ||
{{ clustered_cols(label="clustered by") }} | ||
{{ location_clause() }} | ||
{{ comment_clause() }} | ||
as | ||
{{ compiled_code }} | ||
{%- endif -%} | ||
{%- elif language == 'python' -%} | ||
{#-- | ||
N.B. Python models _can_ write to temp views HOWEVER they use a different session | ||
and have already expired by the time they need to be used (I.E. in merges for incremental models) | ||
|
||
TODO: Deep dive into spark sessions to see if we can reuse a single session for an entire | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this a result of using jobs? I think each job always has a different Spark session There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the comment from @iknox-fa. The main issue here is that the python part of the model building have it's session from the jobs but the rest of the logic for the model have another session, we will have to delete the python tmp table after the merge logic(using existing SQL) vs if it is all SQL we can just make a true tmp table and it will be gone after current dbt model finishes |
||
dbt invocation. | ||
--#} | ||
{{ py_write_table(compiled_code=compiled_code, target_relation=relation) }} | ||
{%- endif -%} | ||
{%- endmacro -%} | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
{% materialization table, adapter = 'spark' %} | ||
|
||
{%- set language = model['language'] -%} | ||
{%- set identifier = model['alias'] -%} | ||
{%- set grant_config = config.get('grants') -%} | ||
|
||
|
@@ -19,9 +19,10 @@ | |
{%- endif %} | ||
|
||
-- build model | ||
{% call statement('main') -%} | ||
{{ create_table_as(False, target_relation, sql) }} | ||
{%- endcall %} | ||
|
||
{%- call statement('main', language=language) -%} | ||
{{ create_table_as(False, target_relation, compiled_code, language) }} | ||
{%- endcall -%} | ||
|
||
{% set should_revoke = should_revoke(old_relation, full_refresh_mode=True) %} | ||
{% do apply_grants(target_relation, grant_config, should_revoke) %} | ||
|
@@ -33,3 +34,18 @@ | |
{{ return({'relations': [target_relation]})}} | ||
|
||
{% endmaterialization %} | ||
|
||
|
||
{% macro py_write_table(compiled_code, target_relation) %} | ||
{{ compiled_code }} | ||
# --- Autogenerated dbt materialization code. --- # | ||
dbt = dbtObj(spark.table) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just a python function later used as def ref(*args,dbt_load_df_function):
refs = {{ ref_dict | tojson }}
key = ".".join(args)
return dbt_load_df_function(refs[key]) This works with the current notebook submit solution. I haven't tested There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, got it, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can make the
|
||
df = model(dbt, spark) | ||
df.write.mode("overwrite").format("delta").saveAsTable("{{ target_relation }}") | ||
{%- endmacro -%} | ||
|
||
{%macro py_script_comment()%} | ||
# how to execute python model in notebook | ||
# dbt = dbtObj(spark.table) | ||
# df = model(dbt, spark) | ||
{%endmacro%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What user is expected to be used in the automated/scheduled dbt jobs for the production system? I think this implies a user should be created for that system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial thought is that user would be the Databricks user who created the token to use for production environment. But following the discussion in the above thread, if we pivot to do
spark_python_task
, then this could be different setup on production(configs needed for s3 or DBFS)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue the discussion in the other thread. I would be in favor of not requiring a user to be stated.