Update star macro to use adapter.quote #205

bjgbeelen · 2020-04-07T20:19:24Z

The default identifier using the double quotes is not working when using Spark as backend.
We should use backticks in this case.

Also, there was a typo in the name identifer.sql so I renamed it to identifier.sql

Regarding testing, i have tested this in a local setup (using the docker-compose from the spark-dbt package). Adding Spark in the integration tests is probably a separate PR on its own, which I'm happy to investigate when I find the time

The default identifier using the double quotes is not working when using Spark as backend. We should use backticks in this case. Also, there was a typo in the name `identifer.sql` so I renamed it to `identifier.sql`

bjgbeelen · 2020-04-10T10:15:15Z

I have started working on adding spark integration here: https://github.com/bjgbeelen/dbt-utils/tree/add-spark-integration

I've got a local docker-compose setup against which I can run the tests. There's still some work to do :-)

dbt seeds: PASS=37 WARN=0 ERROR=9 SKIP=0 TOTAL=46
dbt run: PASS=20 WARN=0 ERROR=15 SKIP=0 TOTAL=35
dbt test: PASS=41 WARN=0 ERROR=16 SKIP=0 TOTAL=57

drewbanin · 2020-04-10T13:17:03Z

Hey @bjgbeelen - this looks really cool!

@jtcohen6 spoke about this PR the other day. We think that the best answer here would be to not include spark-specific logic in this package. Instead, we want to make dbt able to better share macros across packages. In that world, we could have a dbt-utils-spark package which implements macros like spark__identifier, and dbt-utils would be able to find and delegate calls to the spark-specific macro across packages.

We need to make some code changes in dbt to support this, and it's not something that's going to happen very soon, but I would like to prioritize it for a future release.

Until then, @jtcohen6, what do you think? Should we just add some spark-specific logic in here and remove it when dbt's capabilities improve? Or do you think we'd be better off keeping spark out of dbt-utils for the time being?

Also curious to hear what you think @bjgbeelen and @Fokko :)

jtcohen6 · 2020-04-13T16:21:27Z

I agree with all of the above @drewbanin. I'm in favor of prioritizing the requisite changes to dbt-core to support package "descendants," e.g. dbt-utils → dbt-utils-spark (or spark-utils). I believe that to be the only way we'll be able to scale out support for dbt-utils-sqlserver, dbt-utils-athena, and all the other community-supported plugins yet to come.

Along those lies, I played around with an attempt to implement dbt_utils.date_spine for Spark a few months ago: dbt-labs/spark-utils#1

I ended up having to copy-paste a lot of code, and a much more elegant implementation would require the changes to adapter macros which you discuss here: dbt-labs/dbt-core#2301

Until then, @jtcohen6, what do you think? Should we just add some spark-specific logic in here and remove it when dbt's capabilities improve? Or do you think we'd be better off keeping spark out of dbt-utils for the time being?

Personally, I'm in favor of lowering the drawbridge to non-core adapters—especially Spark, which is our best-supported and most widely used. The code should all look the same, here and now vs. later and elsewhere. It'll just be a matter of copy-pasting all spark__ macros from this package into a different one.

Of course, anything Spark-specific or Spark-exclusive should go in one of dbt-spark or spark-utils.

@clrcrl: I recognize the inclusion of non-core adapters may add some maintenance burden to dbt-utils. I respect your input as the person who is ultimately maintaining the package and responding to incoming contributions.

drewbanin · 2020-04-14T02:28:12Z

Personally, I'm in favor of lowering the drawbridge to non-core adapters—especially Spark, which is our best-supported and most widely used. The code should all look the same, here and now vs. later and elsewhere. It'll just be a matter of copy-pasting all spark__ macros from this package into a different one.

Thanks @jtcohen6 - I'm supportive of that change, but also want to hear what @clrcrl thinks too :)

bjgbeelen · 2020-04-14T08:58:14Z

Thanks all for the discussion so far! First, some context on my PR:

At a client we are using dbt together with Spark. I was in need for a solution to exclude some columns from my selection and that's how I got to try the star macro in this package. However, there I ran into the issue that the wrong identifier is currently in place for Spark, so I ended up having the column names as my values. I solved it by manual selecting all the columns that I need (at several places, so nu generic solution using star). So in that light, there is no immediate use for this PR right away (for me). That being said, looking towards the future spending more time with dbt in combination with spark (which is awesome btw!) having access to dbt-utils package - working and tested with Spark - would be a great addition in creating elegant solutions.

I like the idea about a dbt-utils(-core) and adapter specific packages with macro overrides. It would be great though if in the mean time the community, like my example at the client, can already benefit from dbt-utils with other adapters when needed. Especially if it is eventually a case of just copying the spark_ prefixed methods to a new package, which seems quite trivial to me.

I'm not sure what is considered core, but I'd feel you'd also want a dbt-utils-bigquery in this case then, since there also is a specific bigquery override in the identifier.sql. And similarly, i see method overrides for snowflake and redshift too. As a newby to dbt, having used it for just 3 months now, the distinction feels odd why these would be considered core and other adapters not. Just my observation which you can happily ignore ;-)

More specifically for the identifier.sql, it seems similar logic also has to be defined in the adapters themselves. I haven't looked into the details on how they relate, but ideally dbt-utils could 'just' use the implementation of the adapter (column.quoted or something similar) .

So regarding (eventually) adding integration tests against Spark (either in this package or the dbt-utils-spark package). I could maybe first propose a PR to change the structure regarding the testing as I have done here:

That would open up possibilities to also have local testing support (against only Postgres as a start) and reusing the same steps for both local testing as well as circleci integration. Then I can build further upon that to run against a Spark container setup (in a separate branch still)?

jtcohen6 · 2020-04-15T17:31:08Z

I'm not sure what is considered core, but I'd feel you'd also want a dbt-utils-bigquery in this case then, since there also is a specific bigquery override in the identifier.sql. And similarly, i see method overrides for snowflake and redshift too. As a newby to dbt, having used it for just 3 months now, the distinction feels odd why these would be considered core and other adapters not. Just my observation which you can happily ignore ;-)

@bjgbeelen You raise a good point! You're right—BigQuery is often just as idiosyncratic as Spark, and it requires its own implementations of macros in this package.

The question of "what is considered core" is in some sense a historical construct, and in another sense it reflects our degree of ongoing support. Postgres, Redshift, Snowflake, and BigQuery are all "core" adapters; we will always ship new versions of dbt-postgres, dbt-redshift, dbt-snowflake, and dbt-bigquery alongside new releases of dbt.

While we also maintain dbt-spark and dbt-presto, their release cadence lags behind the "core" adapters; they don't implement all dbt features; and there are caveats to some of their usage. There are a handful of community-supported plugins, too, for which these caveats also apply. That said, we're invested in making the dbt-spark plugin as good as it can be—I'm glad you've been successful in using dbt + Spark together!

More specifically for the identifier.sql, it seems similar logic also has to be defined in the adapters themselves. I haven't looked into the details on how they relate, but ideally dbt-utils could 'just' use the implementation of the adapter (column.quoted or something similar) .

Again, you're right — this macro, dbt_utils.identifier, is exactly duplicative of the logic defined in each plugin as an adapter method, adapter.quote. E.g. in dbt-bigquery, in dbt-spark. In this specific case, dbt_utils.star would probably be better off using adapter.quote instead of dbt_utils.identifier.

In the general case, though, we'll need a new version of adapter macros that enables "child" packages to override "parent" macros. E.g. dbt_utils.date_spine depends on dbt_utils.dateadd and dbt_utils.datediff; dbt-utils-spark should be able to override dbt_utils.dateadd and dbt_utils.datediff without needing to fully reimplement date_spine.

bjgbeelen · 2020-04-16T13:33:42Z

It seems that that whole identifier macro was only being used by the star macro and that replacing that with adapter.quote indeed seems to give appropriate results. So i went that direction.

No need for Spark specific overrides yet :-)

drewbanin

This LGTM! Going to ship it when the integration tests pass :)

bjgbeelen · 2020-04-21T17:36:21Z

Hmm I'm not sure if/how my change relates to the snowflake safe_cast test that fails?

drewbanin · 2020-04-21T18:25:02Z

Oops- I actually spoke to soon in my last comment. Can you revert the deletion of the identifier macros? Some folks out there might be using those. We should, in a future PR, add a deprecation warning indicating that identifier is deprecated and will be removed in a future release, OR we should just swap out the implementation with adapter.quote. I don't feel super strongly either way, but it will be easier for us to ship this change if there are no breaking changes

bjgbeelen · 2020-04-21T18:34:41Z

@drewbanin they passed now after you re-triggered them?

drewbanin · 2020-04-21T20:43:43Z

@bjgbeelen just wanted to make sure you saw my comment above #205 (comment)

bjgbeelen · 2020-04-22T11:19:48Z

Hi @drewbanin ,

i indeed overlooked your comments as I was distracted by the (snow)flaky test that failed ;-) But I agree about not making a breaking change. It didn't come to mind yet that the identifier macro itself could be used by users of the library. I only checked it wasn't used by any other macros.

I reverted the deletion (I did still correct the spelling error). I also added a deprecation warning. I think it is cleaner to get rid of this one eventually.

drewbanin · 2020-04-23T00:20:49Z

Thanks for sticking with it @bjgbeelen! This LGTM - going to merge it now :)

Add identifier for Spark

4e6e8ea

The default identifier using the double quotes is not working when using Spark as backend. We should use backticks in this case. Also, there was a typo in the name `identifer.sql` so I renamed it to `identifier.sql`

clrcrl requested a review from jtcohen6 April 7, 2020 20:28

Fokko approved these changes Apr 7, 2020

View reviewed changes

bjgbeelen mentioned this pull request Apr 16, 2020

Simplify running the integration tests #211

Merged

Use adapter.quote instead of adding spark__identifier

f04debb

bjgbeelen changed the title ~~Add identifier for Spark~~ Update star macro to use adapter.quote Apr 16, 2020

bjgbeelen changed the title ~~Update star macro to use adapter.quote~~ Update star macro to use adapter.quote Apr 16, 2020

Remove identifier.sql because it is not used elsewhere

ba87605

clrcrl assigned drewbanin Apr 20, 2020

drewbanin approved these changes Apr 21, 2020

View reviewed changes

Revert deletion of identifier macro. Add deprecation warning

c95ed6b

drewbanin merged commit 7abce81 into dbt-labs:master Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update star macro to use adapter.quote #205

Update star macro to use adapter.quote #205

bjgbeelen commented Apr 7, 2020

bjgbeelen commented Apr 10, 2020

drewbanin commented Apr 10, 2020

jtcohen6 commented Apr 13, 2020

drewbanin commented Apr 14, 2020

bjgbeelen commented Apr 14, 2020

jtcohen6 commented Apr 15, 2020 •

edited

Loading

bjgbeelen commented Apr 16, 2020

drewbanin left a comment

bjgbeelen commented Apr 21, 2020

drewbanin commented Apr 21, 2020

bjgbeelen commented Apr 21, 2020

drewbanin commented Apr 21, 2020

bjgbeelen commented Apr 22, 2020

drewbanin commented Apr 23, 2020

Update star macro to use adapter.quote #205

Update star macro to use adapter.quote #205

Conversation

bjgbeelen commented Apr 7, 2020

bjgbeelen commented Apr 10, 2020

drewbanin commented Apr 10, 2020

jtcohen6 commented Apr 13, 2020

drewbanin commented Apr 14, 2020

bjgbeelen commented Apr 14, 2020

jtcohen6 commented Apr 15, 2020 • edited Loading

bjgbeelen commented Apr 16, 2020

drewbanin left a comment

Choose a reason for hiding this comment

bjgbeelen commented Apr 21, 2020

drewbanin commented Apr 21, 2020

bjgbeelen commented Apr 21, 2020

drewbanin commented Apr 21, 2020

bjgbeelen commented Apr 22, 2020

drewbanin commented Apr 23, 2020

jtcohen6 commented Apr 15, 2020 •

edited

Loading