Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update star macro to use adapter.quote #205

Merged
merged 4 commits into from
Apr 23, 2020

Conversation

bjgbeelen
Copy link

The default identifier using the double quotes is not working when using Spark as backend.
We should use backticks in this case.

Also, there was a typo in the name identifer.sql so I renamed it to identifier.sql

Regarding testing, i have tested this in a local setup (using the docker-compose from the spark-dbt package). Adding Spark in the integration tests is probably a separate PR on its own, which I'm happy to investigate when I find the time

The default identifier using the double quotes is not working when using Spark as backend.
We should use backticks in this case.

Also, there was a typo in the name `identifer.sql` so I renamed it to `identifier.sql`
@clrcrl clrcrl requested a review from jtcohen6 April 7, 2020 20:28
@bjgbeelen
Copy link
Author

I have started working on adding spark integration here: https://github.com/bjgbeelen/dbt-utils/tree/add-spark-integration

I've got a local docker-compose setup against which I can run the tests. There's still some work to do :-)

dbt seeds: PASS=37 WARN=0 ERROR=9 SKIP=0 TOTAL=46
dbt run: PASS=20 WARN=0 ERROR=15 SKIP=0 TOTAL=35
dbt test: PASS=41 WARN=0 ERROR=16 SKIP=0 TOTAL=57

@drewbanin
Copy link
Contributor

Hey @bjgbeelen - this looks really cool!

@jtcohen6 spoke about this PR the other day. We think that the best answer here would be to not include spark-specific logic in this package. Instead, we want to make dbt able to better share macros across packages. In that world, we could have a dbt-utils-spark package which implements macros like spark__identifier, and dbt-utils would be able to find and delegate calls to the spark-specific macro across packages.

We need to make some code changes in dbt to support this, and it's not something that's going to happen very soon, but I would like to prioritize it for a future release.

Until then, @jtcohen6, what do you think? Should we just add some spark-specific logic in here and remove it when dbt's capabilities improve? Or do you think we'd be better off keeping spark out of dbt-utils for the time being?

Also curious to hear what you think @bjgbeelen and @Fokko :)

@jtcohen6
Copy link
Contributor

I agree with all of the above @drewbanin. I'm in favor of prioritizing the requisite changes to dbt-core to support package "descendants," e.g. dbt-utilsdbt-utils-spark (or spark-utils). I believe that to be the only way we'll be able to scale out support for dbt-utils-sqlserver, dbt-utils-athena, and all the other community-supported plugins yet to come.

Along those lies, I played around with an attempt to implement dbt_utils.date_spine for Spark a few months ago: dbt-labs/spark-utils#1

I ended up having to copy-paste a lot of code, and a much more elegant implementation would require the changes to adapter macros which you discuss here: dbt-labs/dbt-core#2301

Until then, @jtcohen6, what do you think? Should we just add some spark-specific logic in here and remove it when dbt's capabilities improve? Or do you think we'd be better off keeping spark out of dbt-utils for the time being?

Personally, I'm in favor of lowering the drawbridge to non-core adapters—especially Spark, which is our best-supported and most widely used. The code should all look the same, here and now vs. later and elsewhere. It'll just be a matter of copy-pasting all spark__ macros from this package into a different one.

Of course, anything Spark-specific or Spark-exclusive should go in one of dbt-spark or spark-utils.

@clrcrl: I recognize the inclusion of non-core adapters may add some maintenance burden to dbt-utils. I respect your input as the person who is ultimately maintaining the package and responding to incoming contributions.

@drewbanin
Copy link
Contributor

Personally, I'm in favor of lowering the drawbridge to non-core adapters—especially Spark, which is our best-supported and most widely used. The code should all look the same, here and now vs. later and elsewhere. It'll just be a matter of copy-pasting all spark__ macros from this package into a different one.

Thanks @jtcohen6 - I'm supportive of that change, but also want to hear what @clrcrl thinks too :)

@bjgbeelen
Copy link
Author

Thanks all for the discussion so far! First, some context on my PR:

At a client we are using dbt together with Spark. I was in need for a solution to exclude some columns from my selection and that's how I got to try the star macro in this package. However, there I ran into the issue that the wrong identifier is currently in place for Spark, so I ended up having the column names as my values. I solved it by manual selecting all the columns that I need (at several places, so nu generic solution using star). So in that light, there is no immediate use for this PR right away (for me). That being said, looking towards the future spending more time with dbt in combination with spark (which is awesome btw!) having access to dbt-utils package - working and tested with Spark - would be a great addition in creating elegant solutions.

I like the idea about a dbt-utils(-core) and adapter specific packages with macro overrides. It would be great though if in the mean time the community, like my example at the client, can already benefit from dbt-utils with other adapters when needed. Especially if it is eventually a case of just copying the spark_ prefixed methods to a new package, which seems quite trivial to me.

I'm not sure what is considered core, but I'd feel you'd also want a dbt-utils-bigquery in this case then, since there also is a specific bigquery override in the identifier.sql. And similarly, i see method overrides for snowflake and redshift too. As a newby to dbt, having used it for just 3 months now, the distinction feels odd why these would be considered core and other adapters not. Just my observation which you can happily ignore ;-)

More specifically for the identifier.sql, it seems similar logic also has to be defined in the adapters themselves. I haven't looked into the details on how they relate, but ideally dbt-utils could 'just' use the implementation of the adapter (column.quoted or something similar) .

So regarding (eventually) adding integration tests against Spark (either in this package or the dbt-utils-spark package). I could maybe first propose a PR to change the structure regarding the testing as I have done here:

That would open up possibilities to also have local testing support (against only Postgres as a start) and reusing the same steps for both local testing as well as circleci integration. Then I can build further upon that to run against a Spark container setup (in a separate branch still)?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 15, 2020

I'm not sure what is considered core, but I'd feel you'd also want a dbt-utils-bigquery in this case then, since there also is a specific bigquery override in the identifier.sql. And similarly, i see method overrides for snowflake and redshift too. As a newby to dbt, having used it for just 3 months now, the distinction feels odd why these would be considered core and other adapters not. Just my observation which you can happily ignore ;-)

@bjgbeelen You raise a good point! You're right—BigQuery is often just as idiosyncratic as Spark, and it requires its own implementations of macros in this package.

The question of "what is considered core" is in some sense a historical construct, and in another sense it reflects our degree of ongoing support. Postgres, Redshift, Snowflake, and BigQuery are all "core" adapters; we will always ship new versions of dbt-postgres, dbt-redshift, dbt-snowflake, and dbt-bigquery alongside new releases of dbt.

While we also maintain dbt-spark and dbt-presto, their release cadence lags behind the "core" adapters; they don't implement all dbt features; and there are caveats to some of their usage. There are a handful of community-supported plugins, too, for which these caveats also apply. That said, we're invested in making the dbt-spark plugin as good as it can be—I'm glad you've been successful in using dbt + Spark together!

More specifically for the identifier.sql, it seems similar logic also has to be defined in the adapters themselves. I haven't looked into the details on how they relate, but ideally dbt-utils could 'just' use the implementation of the adapter (column.quoted or something similar) .

Again, you're right — this macro, dbt_utils.identifier, is exactly duplicative of the logic defined in each plugin as an adapter method, adapter.quote. E.g. in dbt-bigquery, in dbt-spark. In this specific case, dbt_utils.star would probably be better off using adapter.quote instead of dbt_utils.identifier.

In the general case, though, we'll need a new version of adapter macros that enables "child" packages to override "parent" macros. E.g. dbt_utils.date_spine depends on dbt_utils.dateadd and dbt_utils.datediff; dbt-utils-spark should be able to override dbt_utils.dateadd and dbt_utils.datediff without needing to fully reimplement date_spine.

@bjgbeelen bjgbeelen changed the title Add identifier for Spark Update star macro to use adapter.quote Apr 16, 2020
@bjgbeelen bjgbeelen changed the title Update star macro to use adapter.quote Update star macro to use adapter.quote Apr 16, 2020
@bjgbeelen
Copy link
Author

It seems that that whole identifier macro was only being used by the star macro and that replacing that with adapter.quote indeed seems to give appropriate results. So i went that direction.

No need for Spark specific overrides yet :-)

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Going to ship it when the integration tests pass :)

@bjgbeelen
Copy link
Author

Hmm I'm not sure if/how my change relates to the snowflake safe_cast test that fails?

@drewbanin
Copy link
Contributor

Oops- I actually spoke to soon in my last comment. Can you revert the deletion of the identifier macros? Some folks out there might be using those. We should, in a future PR, add a deprecation warning indicating that identifier is deprecated and will be removed in a future release, OR we should just swap out the implementation with adapter.quote. I don't feel super strongly either way, but it will be easier for us to ship this change if there are no breaking changes

@bjgbeelen
Copy link
Author

@drewbanin they passed now after you re-triggered them?

@drewbanin
Copy link
Contributor

@bjgbeelen just wanted to make sure you saw my comment above #205 (comment)

@bjgbeelen
Copy link
Author

Hi @drewbanin ,

i indeed overlooked your comments as I was distracted by the (snow)flaky test that failed ;-) But I agree about not making a breaking change. It didn't come to mind yet that the identifier macro itself could be used by users of the library. I only checked it wasn't used by any other macros.

I reverted the deletion (I did still correct the spelling error). I also added a deprecation warning. I think it is cleaner to get rid of this one eventually.

@drewbanin
Copy link
Contributor

Thanks for sticking with it @bjgbeelen! This LGTM - going to merge it now :)

@drewbanin drewbanin merged commit 7abce81 into dbt-labs:master Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants