Specify encodings for materialised tables on Redshift #1016

jdub · 2018-09-20T01:51:14Z

Feature description

My materialised tables are created with pretty good defaults -- no encoding for sort keys, lzo for everything else -- but I'd like to switch to zstd, and perhaps provide my own per-column settings.

Basic approach: Specify a single encoding for unsorted columns.

Advanced approach: Specify an encoding for individual columns, perhaps falling back to a default.

Who will this benefit?

Redshift users who want to customise their data encoding. I'd hope that'd be a lot of dbt users!

drewbanin · 2018-09-20T13:16:14Z

hey @jdub! Thanks for creating this issue. I want to document my thinking around 1) why this doesn't exist already and 2) how we could make it happen in the future.

Why this doesn't exist already

dbt uses create table as (...) statements to build tables in Redshift. These statements don't support the specification of column encodings, so we'd need to create an empty table with the proper encodings, then insert into it inside of dbt.

Creating this empty table from DDL has always been problematic in dbt, as we don't really have a good place to put this DDL code. Moreover, I'd argue that dbt exists to help mitigate the need for actually writing DDL! These days, I consider "needing to care about column encoding" a limitation of Redshift, rather than a feature.

How we could build this

The new v2 schema.yml files support column specifications. I think there might be an opportunity to augment these schema specs with things like column types and encodings. dbt could then use the schema spec for a model, coupled with model configs (like sort/dist keys) to generate the DDL for a table. It could run this DDL, then use the model SQL to insert into the table.

I like that this approach doesn't force users to write DDL directly, but I fear it might be a little bit constraining. There are heaps of other table DDL options, like foreign keys / unique keys / etc, that I don't have a ton of interest in supporting in dbt. It's just not the right tool for that particular job.

So, it could be that the better option is to to just allow users to write their own DDL.

Regarding some specific things you mentioned in the issue:

Basic approach: Specify a single encoding for unsorted columns.
You have to set encodings for all columns at once, when the table is created, so I don't think there's a way to make this happen

Advanced approach: Specify an encoding for individual columns, perhaps falling back to a default.

We could probably do this with the schema.yml approach

Definitely keen to hear your thoughts here!

jdub · 2018-09-21T10:11:49Z

Aha! The schema.yaml approach would suit this very nicely, deferring to #582 for more advanced requirements.

drewbanin · 2019-04-15T13:40:07Z

I'm going to close this - until Redshift provides a mechanism for configuring column encodings in a create table as statement, we don't have any real options here. An enumeration of all the columns, types, and encodings for a table is antithetical to how dbt fundamentally operates

drewbanin · 2019-11-26T01:39:50Z

closing (again) in favor of #1438

dataxpress · 2020-02-22T19:39:55Z

In my experience the workflow in Redshift to add compression to an existing table is to rename that table, make a new table with desired encodings, then insert the data from the previous table to the new table, then drop the previous table.

In my dbt setup my biggest models are also incremental models.

In theory, I can sneak into Redshift, run the above operation on my already-existing incremental tables, and then as long as the table isn't dropped or a --full-refresh is run, the table keeps my compression. (Right?)

Is it possible to integrate the above steps into the dbt job as some sort of post-hook that only runs after --full-refresh on an incremental model? I believe this would cover most of the use cases for compression, as the models most needing compression are generally the models who also most need incremental updates.

Apologies if this is already possible, as I am far from a dbt expert!

drewbanin · 2020-03-10T00:32:20Z

hey @dataxpress - check out this macro. It implements something like what you're describing, using analyze compression to find optimal encodings. To be very honest, I haven't run that code in a long time, but it definitely worked in the past :)

I'd either invoke this macro from a post-hook (checking the flags.FULL_REFRESH flag) or use an "operation" to invoke this macro for particular models on an as-needed basis. Hope this helps!

jdub mentioned this issue Sep 20, 2018

Custom table DDL integrated with the dbt flow #1017

Closed

drewbanin closed this as completed Apr 15, 2019

drewbanin mentioned this issue May 6, 2019

Specify table schemas before creating models for extra validation #1438

Closed

drewbanin reopened this Aug 23, 2019

drewbanin closed this as completed Nov 26, 2019

jtcohen6 mentioned this issue Dec 4, 2020

Support column_name and column_schema in BigQuery #2936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify encodings for materialised tables on Redshift #1016

Specify encodings for materialised tables on Redshift #1016

jdub commented Sep 20, 2018

drewbanin commented Sep 20, 2018

jdub commented Sep 21, 2018

drewbanin commented Apr 15, 2019

drewbanin commented Nov 26, 2019

dataxpress commented Feb 22, 2020

drewbanin commented Mar 10, 2020

Specify encodings for materialised tables on Redshift #1016

Specify encodings for materialised tables on Redshift #1016

Comments

jdub commented Sep 20, 2018

Feature description

Who will this benefit?

drewbanin commented Sep 20, 2018

jdub commented Sep 21, 2018

drewbanin commented Apr 15, 2019

drewbanin commented Nov 26, 2019

dataxpress commented Feb 22, 2020

drewbanin commented Mar 10, 2020