Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Databricks SQL syntax error in marketo__lead_history #31

Closed
2 of 4 tasks
cdey-a16z opened this issue Feb 23, 2024 · 9 comments
Closed
2 of 4 tasks

[Bug] Databricks SQL syntax error in marketo__lead_history #31

cdey-a16z opened this issue Feb 23, 2024 · 9 comments
Assignees
Labels
error:unforced priority:p3 Affects many users; can wait status:in_progress Currently being worked on type:bug Something is broken or incorrect update_type:models Primary focus requires model updates

Comments

@cdey-a16z
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

I am working on implementing the Marketo package in a dbt instance using Databricks. The marketo__lead_history model is failing, and so far I have been able to pinpoint two issues.

  1. fivetran_utils.dummy_coalesce_value macro is not working properly when applied to 'lead_status' which is a string field.

    • This macro is called in the backfill cte , and is used to coalesce a dummy value, which in this case would be "DUMMY STRING", but is instead returning nothing, causing the coalesce to fail.
  2. When correcting for the failing macro, the SQL still fails because it appears that the IGNORE NULLS statement in the window function is not recognized in Databricks.

Relevant error log or model output

dummy_coalesce_value macro (abbreviated for relevance):

{% macro dummy_coalesce_value(column) %}

{% set coalesce_value = {
 'STRING': "'DUMMY_STRING'",
 'BOOLEAN': 'null',
 'INT': 999999999,
 'FLOAT': 999999999.99,
 'TIMESTAMP': 'cast("2099-12-31" as timestamp)',
 'DATE': 'cast("2099-12-31" as date)',
} %}

{% if column.is_float() %}
{{ return(coalesce_value['FLOAT']) }}

{% elif column.is_numeric() %}
{{ return(coalesce_value['INT']) }}

{% elif column.is_string() %} -- this is the problematic statement
{{ return(coalesce_value['STRING']) }}

{% endif %}

{% endmacro %}

model output:

, backfill as (


   select
       date_day,
       lead_id   
       -- For each lead on each day, find the state of each column from the next record where a change occurred,
       -- identified by the presence of a record from the SCD table on that day
       
       , nullif(
           first_value(case when new_values_present then coalesce(lead_status, ) end ignore nulls) over (
               partition by lead_id
               order by date_day asc
               rows between current row and unbounded following),  ) -- nothing in the coalesce, causing failure
       as lead_status

syntax error:

16:12:08 Databricks adapter: <class 'databricks.sql.exc.ServerOperationError'>: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'ignore'.(line 60, pos 100)

, nullif(
            first_value(case when new_values_present then coalesce(lead_status, 'DUMMY_STRING') end ignore nulls) over (
----------------------------------------------------------------------------------------------------^^^
                partition by lead_id 
                order by date_day asc 
                rows between current row and unbounded following),  
            'DUMMY_STRING')
        as lead_status

Expected behavior

  1. dummy_coalesce_value macro recognizes the 'lead_status' string, and inserts "DUMMY STRING" into the coalesce.

Expected Output:

nullif(
first_value(case when new_values_present then coalesce(lead_status, 'DUMMY_STRING') end ignore nulls) over (
partition by lead_id
order by date_day asc
rows between current row and unbounded following),
'DUMMY_STRING')
as lead_status

  • When I edit the macro from this:
    {% elif column.is_string() %} -- this is the problematic statement
    {{ return(coalesce_value['STRING']) }}

to this:

{% elif column.data_type|lower == 'string' %}
{{ return(coalesce_value['STRING']) }}

The dummy string is inserted properly into the coalesce. This indicates there is an issue with the column.is_string()

note: This fix only works when I edit the dummy_coalesce_value macro found in the fivetran_utils package, even though the macro also exists in the marketo package. Also, this isn't a permanent fix, because the file exists within a package, so any changes will be wiped out whenever dbt deps is run.

  1. Once the above 'fix' is implemented, we run into a syntax error:
  • looks like ignore nulls is not allowed within a window function in Databricks, as far as I can tell. The following SQL may be a solution:

, COALESCE(
first_value(CASE WHEN new_values_present THEN lead_status ELSE NULL END) OVER (
partition by lead_id
order by date_day asc
rows between current row and unbounded following
),
'default_value' -- or however you wish to handle cases that would have relied on ignoring nulls
) AS lead_status

  • I think that it's possible that there could be a config that I may be missing here, but as you can see below, I have followed the dispatch config instructions provided with the package.

dbt Project configurations

vars:
fivetran_log:
fivetran_log_using_transformations: false
fivetran_log_using_triggers: false

fivetran_platform_database: raw # default is your target.database
fivetran_platform_schema: fivetran_log # default is fivetran_log

fivetran_platform_using_destination_membership: false # this will disable only the destination membership logic
fivetran_platform_using_user: false # this will disable only the user logic

marketo__enable_campaigns: true
marketo__enable_programs: true

marketo_database: raw
marketo_schema: marketo

dispatch:

  • macro_namespace: fivetran_utils
    search_order: ['spark_utils', 'fivetran_utils']
  • macro_namespace: dbt_utils
    search_order: ['spark_utils', 'fivetran_utils', 'dbt_utils']

Package versions

packages:

  • package: fivetran/salesforce_formula_utils
    version: 0.9.3

  • package: dbt-labs/dbt_utils
    version: 1.1.1

  • package: calogica/dbt_expectations
    version: 0.10.3

  • package: dbt-labs/codegen
    version: 0.12.1

  • package: data-mie/dbt_profiler
    version: 0.8.1

  • package: dbt-labs/audit_helper
    version: 0.10.0

  • package: fivetran/marketo
    version: 0.10.0

  • package: dbt-labs/spark_utils
    version: 0.3.0

  • package: fivetran/fivetran_log
    version: 1.5.0

What database are you using dbt with?

databricks

dbt Version

dbt cloud version 1.7

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.
@cdey-a16z cdey-a16z added the bug Something isn't working label Feb 23, 2024
@fivetran-catfritz
Copy link
Contributor

Hi @cdey-a16z thanks for opening this issue! What Databricks runtime are you using? Our packages are tested on Databricks all-purpose clusters, but we are aware that there are some incompatibilities with the SQL Warehouses. If you're on a SQL Warehouse, we'll need to do a little more investigation into testing any updates.

@cdey-a16z
Copy link
Author

Hi @fivetran-catfritz, thanks for getting back to me. Looks like our runtime is 13.3 LTS (see the screenshot below).

We experimented with pointing dbt to an all-purpose cluster yesterday, and the issue persisted.

Let me know if there is more helpful information I can provide.

image

@fivetran-catfritz
Copy link
Contributor

fivetran-catfritz commented Feb 29, 2024

Hi @cdey-a16z thank you for the info and trying that out on the all-purpose cluster. I was able to recreate the issue on our all-purpose cluster as well. While we do not have plans to address general Databricks SQL Warehouse compatibility, any issues that persist on a Databricks all-purpose cluster will definitely be addressed. We'll prioritize addressing the coalesce/ignore nulls issue in our upcoming sprint cycle, which would be approximately mid/end-March.

Thank you again for calling this to our attention!

@fivetran-catfritz fivetran-catfritz added type:bug Something is broken or incorrect priority:p3 Affects many users; can wait status:accepted Scoped and accepted into queue update_type:models Primary focus requires model updates error:unforced and removed bug Something isn't working labels Feb 29, 2024
@cdey-a16z
Copy link
Author

@fivetran-catfritz sounds great I appreciate your help with this. Will I be updated here when this is fixed?

@fivetran-catfritz
Copy link
Contributor

Yes we'll post here when the update is released!

@fivetran-catfritz fivetran-catfritz added status:in_progress Currently being worked on and removed status:accepted Scoped and accepted into queue labels Mar 15, 2024
@fivetran-catfritz fivetran-catfritz self-assigned this Mar 20, 2024
@fivetran-catfritz
Copy link
Contributor

Hi @cdey-a16z to update you, I have found a work-around for the issue you mention. I added this cte that removes the need for an ignore nulls statement and also makes the dummy string issue you found to be moot. If you would like to try out my updates, you can install my test branch using the below code in place of the marketo lines in your packages.yml file.

- git: https://github.com/fivetran/dbt_marketo.git
  revision: bug/databricks-syntax
  warn-unpinned: false

If you try it out, I'd love to hear how it works for you! If not, no worries, we'll proceed with release in the coming week or so and update you when it's released.

@cdey-a16z
Copy link
Author

Hi @fivetran-catfritz, thanks for getting back to me. I just tested your test version, and it worked! The only thing I noticed was that I had to update field from marketo__leads called unsubscribed to is_unsubscribed. I'm not sure if that's related at all, but just wanted to call that out in case it is.

Looking forward to implementing the completed package!

@fivetran-catfritz
Copy link
Contributor

@cdey-a16z Thank you for trying that out and confirming! Glad that it worked. We'll proceed with release in the coming days and post back here once it's live.

Regarding the is_subscribed column, I confirm we updated the source package to rename boolean columns to the format is_*. If you're interested, there are more details in the source changelog we plan to release alongside this update.

@fivetran-catfritz
Copy link
Contributor

@cdey-a16z To update you, the new version has been released! You can install it with the below range:

packages:
  - package: fivetran/marketo
    version: [">=0.11.0", "<0.12.0"]

I am closing out this issue, but feel free to ping us here if you have any additional questions or feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error:unforced priority:p3 Affects many users; can wait status:in_progress Currently being worked on type:bug Something is broken or incorrect update_type:models Primary focus requires model updates
Projects
None yet
Development

No branches or pull requests

2 participants