Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify encodings for materialised tables on Redshift #1016

Closed
jdub opened this issue Sep 20, 2018 · 6 comments
Closed

Specify encodings for materialised tables on Redshift #1016

jdub opened this issue Sep 20, 2018 · 6 comments

Comments

@jdub
Copy link

jdub commented Sep 20, 2018

Feature description

My materialised tables are created with pretty good defaults -- no encoding for sort keys, lzo for everything else -- but I'd like to switch to zstd, and perhaps provide my own per-column settings.

Basic approach: Specify a single encoding for unsorted columns.

Advanced approach: Specify an encoding for individual columns, perhaps falling back to a default.

Who will this benefit?

Redshift users who want to customise their data encoding. I'd hope that'd be a lot of dbt users!

@drewbanin
Copy link
Contributor

hey @jdub! Thanks for creating this issue. I want to document my thinking around 1) why this doesn't exist already and 2) how we could make it happen in the future.

  1. Why this doesn't exist already

dbt uses create table as (...) statements to build tables in Redshift. These statements don't support the specification of column encodings, so we'd need to create an empty table with the proper encodings, then insert into it inside of dbt.

Creating this empty table from DDL has always been problematic in dbt, as we don't really have a good place to put this DDL code. Moreover, I'd argue that dbt exists to help mitigate the need for actually writing DDL! These days, I consider "needing to care about column encoding" a limitation of Redshift, rather than a feature.

  1. How we could build this

The new v2 schema.yml files support column specifications. I think there might be an opportunity to augment these schema specs with things like column types and encodings. dbt could then use the schema spec for a model, coupled with model configs (like sort/dist keys) to generate the DDL for a table. It could run this DDL, then use the model SQL to insert into the table.

I like that this approach doesn't force users to write DDL directly, but I fear it might be a little bit constraining. There are heaps of other table DDL options, like foreign keys / unique keys / etc, that I don't have a ton of interest in supporting in dbt. It's just not the right tool for that particular job.

So, it could be that the better option is to to just allow users to write their own DDL.


Regarding some specific things you mentioned in the issue:

Basic approach: Specify a single encoding for unsorted columns.
You have to set encodings for all columns at once, when the table is created, so I don't think there's a way to make this happen

Advanced approach: Specify an encoding for individual columns, perhaps falling back to a default.

We could probably do this with the schema.yml approach

Definitely keen to hear your thoughts here!

@jdub
Copy link
Author

jdub commented Sep 21, 2018

Aha! The schema.yaml approach would suit this very nicely, deferring to #582 for more advanced requirements.

@drewbanin
Copy link
Contributor

I'm going to close this - until Redshift provides a mechanism for configuring column encodings in a create table as statement, we don't have any real options here. An enumeration of all the columns, types, and encodings for a table is antithetical to how dbt fundamentally operates

@drewbanin
Copy link
Contributor

closing (again) in favor of #1438

@dataxpress
Copy link

In my experience the workflow in Redshift to add compression to an existing table is to rename that table, make a new table with desired encodings, then insert the data from the previous table to the new table, then drop the previous table.

In my dbt setup my biggest models are also incremental models.

In theory, I can sneak into Redshift, run the above operation on my already-existing incremental tables, and then as long as the table isn't dropped or a --full-refresh is run, the table keeps my compression. (Right?)

Is it possible to integrate the above steps into the dbt job as some sort of post-hook that only runs after --full-refresh on an incremental model? I believe this would cover most of the use cases for compression, as the models most needing compression are generally the models who also most need incremental updates.

Apologies if this is already possible, as I am far from a dbt expert!

@drewbanin
Copy link
Contributor

hey @dataxpress - check out this macro. It implements something like what you're describing, using analyze compression to find optimal encodings. To be very honest, I haven't run that code in a long time, but it definitely worked in the past :)

I'd either invoke this macro from a post-hook (checking the flags.FULL_REFRESH flag) or use an "operation" to invoke this macro for particular models on an as-needed basis. Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants