Support for Distributed table engine #14

Volodin-DD · 2021-08-27T09:09:42Z

Hi. We have urge to use distributed engine for our ELT process. I have some work done (very unstable, but somewhat tested) in my fork of your repo Still have many work to do, but dbt run creates distributed table on cluster and inserts values from select there. In very small test everything works as expected.

If you are already working on this feature or have plans (it seems good to have this feature in my opinion), maybe I can help.

silentsokolov · 2021-08-27T09:35:36Z

I am intrigued. Distributed table is a useful feature.

Community and I would be grateful if you could release this feature.

Tell me if you need help.

Volodin-DD · 2021-08-27T10:27:26Z

Great, so I am making a pull request. You can always pm me via telegram (https://t.me/volodin_dd) for less formal discussion in russian). Thanks a lot.

Volodin-DD · 2021-08-31T05:30:48Z

#15 link to pull request

kjdeluna · 2022-01-27T16:42:24Z

@silentsokolov Any updates here?

Volodin-DD · 2022-02-01T14:00:33Z

Very interested too.

silentsokolov · 2022-02-01T17:56:01Z

I think more specifics are needed to implement this issue. Because the distributed tables are part of a cluster. And "cluster management" is not a part of the dbt. Of course, we can use distributed tables but creation and management ... I dont know.

Volodin-DD · 2022-02-01T19:45:43Z

I think more specifics are needed to implement this issue. Because the distributed tables are part of a cluster. And "cluster management" is not a part of the dbt. Of course, we can use distributed tables but creation and management ... I dont know.

Maybe you are right. I was using my solution and then using solution with running models on each shard with distributed-like view in the end (table function cluster). And second one is good. May be there is no reason for adding distributed materialization in adapter. Ephemeral would be more interesting)

Volodin-DD · 2022-02-01T19:46:20Z

Above means, maybe this issue should be closed)

kjdeluna · 2022-02-02T02:37:15Z

Running models on each shard is really a good solution but on my end, this is not possible. One of the solutions that I've thought of is creating the replicated tables manually and then letting DBT take care of the Distributed table.

I think this would work but when I tried it, it's generating an error because Distributed Table doesn't have ORDER BY but ORDER BY is required in table creation

silentsokolov · 2022-02-02T15:44:25Z

I think this would work but when I tried it, it's generating an error because Distributed Table doesn't have ORDER BY but ORDER BY is required in table creation

This sounds like a suggestion, I'll fix it soon

Volodin-DD · 2022-02-03T07:57:05Z

Running models on each shard is really a good solution but on my end, this is not possible. One of the solutions that I've thought of is creating the replicated tables manually and then letting DBT take care of the Distributed table.

I think this would work but when I tried it, it's generating an error because Distributed Table doesn't have ORDER BY but ORDER BY is required in table creation

Maybe this is a solution. You can use table functions in dbt models. And keep all "dbt stuff". For example: SELECT * FROM cluster('cluster_name', {{ source('schema', 'table_name') }}). The main problem is that you cannot insert in such tables, just select data from cluster.

kjdeluna · 2022-02-04T01:49:12Z

Maybe this is a solution. You can use table functions in dbt models. And keep all "dbt stuff". For example: SELECT * FROM cluster('cluster_name', {{ source('schema', 'table_name') }}). The main problem is that you cannot insert in such tables, just select data from cluster.

Can't we just use the Distributed table for both selects and inserts?

gfunc · 2022-03-28T10:22:09Z

using a fork with support for distributed engine as well, repo here

my solution to distributed tables was to create the on cluster distributed table with the model name, in the meanwhile create a "real" table with the name {{model_name}}_local and on cluster clause.

And each type of materialization (whether table or incremental) would require the SQL to run on only one node ( without on cluster clause) and thenINSERT INTO distributed table.

Would love to contribute once I reformat the code.

gfunc · 2022-09-29T10:55:21Z

I started to merge my approach toward distributed table engine. And I want to start a discussion early on which is about the handling of unique_keys.

My production env has huge tables (200GB+ each shard), my approach was to use the OPTIMIZE syntax combined with the partition_by config. (ReplacingMergeTree engine's async optimize does not suit this context). logic as below:

SELECT DISTINCT {{partition_by}} FROM {{intermediate_relation}};
OPTIMIZE TABLE {{model_name}}_local ON CLUSTER {{cluster_name}} PARTITION {{partition_name}} [FINAL] DEDUPLICATE BY {{unique_keys}}, {{partition_column}};

But to be honest, despite the occasional lag of async distributed insert (which could be resolved by SYSTEM FLUSH DISTRIBUTED command) and incompleteness of OPTIMIZE (behavior controlled by replication_alter_partitions_sync setting), as of version 21.10, the performance of OPTIMIZE is not as lightning fast.

Any suggestions, any other solutions ? or is there any performance bump of OPTIMIZE in recent releases?

genzgd · 2022-09-29T12:19:38Z

Optimize is by nature an expensive operation, since it most cases it rewrites all of the data in the table. (You can also OPTIMIZE a ReplacingMergeTree table, which is essentially the same operation as your DEDUPLICATE query).

Unfortunately there's no good answer for unique keys in ClickHouse -- this is a natural result of its sparse index/column oriented architecture. If you can't deduplicate before insert the available options are all less than ideal. The difficulties associated with deduplication/unique keys are one of main tradeoffs for performance.

gfunc · 2022-09-30T03:07:03Z

Optimize is by nature an expensive operation, since it most cases it rewrites all of the data in the table. (You can also OPTIMIZE a ReplacingMergeTree table, which is essentially the same operation as your DEDUPLICATE query).

Unfortunately there's no good answer for unique keys in ClickHouse -- this is a natural result of its sparse index/column oriented architecture. If you can't deduplicate before insert the available options are all less than ideal. The difficulties associated with deduplication/unique keys are one of main tradeoffs for performance.

I agree. I tried to use select * from {{model_name}} final in my DBT ETLs for ReplacingMergeTree tables, and the final keyword has a drastic drawback of performance and I need to align how deduplication is handled when people start to use window functions all over the place, thus I decided that it is important to handle unique_keys within DBT logic for the sake of downstream ETLs.

My use cases for unique_keys are mainly for tables with incremental materialization, and the handling of this situation in the master branch at the moment seems to be a full insert into the new table and then a swap which is much more expensive than OPTIMIZE for partitions with huge tables and basically the same for small tables.

I would go ahead the implement my logic and give it a try.

genzgd · 2022-09-30T03:46:18Z

I agree it's worth a try. The current approach is "okay" for smaller tables, but I'm not at all surprised that incremental materializations get unusably slow when you get to "ClickHouse" scale data. If you could limit the incremental materialization to only the partitions actually affected it might help, but that assumes a level of intelligence in DBT that is going to be a lot of work and a lot of fine tuning for particular tables.

simpl1g · 2023-01-24T15:43:14Z

@genzgd in today's dbt+ClickHouse webinar you answered that you don't need Distributed tables when using Replicated database, but I don't understand how is that possible? According to documentation https://clickhouse.com/docs/en/engines/database-engines/replicated/ we still need them. And I just checked in my cluster, I can't query data from all shards without creating Distributed tables. Am I missing something?

In dbt community this is the first question I see when somebody joins. "How can I work with cluster in dbt?" And standart answer - "You need to rewrite materializations to correctly support Distributed tables". It would be great to resolve this issue

genzgd · 2023-01-24T15:55:34Z

Hi @simpl1g -- I believe you are correct. I've been thinking too much in ClickHouse Cloud terms which doesn't require sharding, but even with a Replicated Database it looks like you need a distributed table for multiple shards. We'll take another look at this issue and see where it fits on our roadmap.

Keyeoh · 2023-03-29T07:38:46Z

Hi there,

have there been any updates on this issue? In the Clickhouse integrations' documentation, I have seen the following:

Is this the recommended way to go? Does dbt create the distributed table by itself? How would it know the name of the cluster that needs to be passed as a parameter to the Distributed() engine definition?

We currently have an on-premises Clickhouse with four shards and no replicas, and we would like to port our current ELT pipelines to dbt, but this is still raising a lot of doubts...

Any hint or help would be much appreciated.

gladkikhtutu · 2023-04-17T15:47:26Z

I also support @simple argument. In our company we have 6 nodes, and we counted that if we run dbt 6 times in every node, we will get 6*6 distributed select queries to get full data and 6 on cluster inserts instead of one. It is too expensive for our infrastructure.

vrfn · 2023-05-12T08:28:19Z

Hi @genzgd, you commented that you'd take a look and see where this issue fits on the roadmap. Do you have any update on this? We'd love to be able to use distributed tables.

genzgd · 2023-05-12T12:53:29Z

@vrfn @simpl1g My apologies for not following up on this. The reality is that our team has had several other priorities over the past several months and the question of distributed tables and dbt is not currently on our development roadmap. We also don't have a good test framework or infrastructure for tests of distributed tables.

We would consider an updated version of the original PR here https://github.com/ClickHouse/dbt-clickhouse/pull/15/files that has been tested in a real world situation and with disclaimers that support is purely experimental. Ideally the feature would only be enabled with an explicit setting and be well separated from "core" dbt functionality so as to not break existing use case.

gladkikhtutu · 2023-06-19T16:43:08Z

I added PR with distributed table materialization with actual structure, the main idea is the same as in above PR. #163
Hope for your reaction

gladkikhtutu · 2023-07-24T09:16:47Z

Added also distributed incremental materialization PR #172

genzgd · 2023-11-30T01:12:12Z

Experimental Distributed table materializations added in #172

silentsokolov assigned silentsokolov and unassigned silentsokolov Aug 27, 2021

silentsokolov added the help wanted Extra attention is needed label Aug 27, 2021

silentsokolov added a commit that referenced this issue Feb 6, 2022

Skip order columns if engine Distributed #14

d1d09e7

mharrisb1 mentioned this issue Sep 8, 2022

Standard table materialization does not work with ReplicateMergeTree engine #95

Closed

genzgd closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Distributed table engine #14

Support for Distributed table engine #14

Volodin-DD commented Aug 27, 2021

silentsokolov commented Aug 27, 2021

Volodin-DD commented Aug 27, 2021

Volodin-DD commented Aug 31, 2021

kjdeluna commented Jan 27, 2022

Volodin-DD commented Feb 1, 2022

silentsokolov commented Feb 1, 2022

Volodin-DD commented Feb 1, 2022

Volodin-DD commented Feb 1, 2022

kjdeluna commented Feb 2, 2022

silentsokolov commented Feb 2, 2022

Volodin-DD commented Feb 3, 2022

kjdeluna commented Feb 4, 2022

gfunc commented Mar 28, 2022

gfunc commented Sep 29, 2022

genzgd commented Sep 29, 2022

gfunc commented Sep 30, 2022

genzgd commented Sep 30, 2022

simpl1g commented Jan 24, 2023

genzgd commented Jan 24, 2023

Keyeoh commented Mar 29, 2023

gladkikhtutu commented Apr 17, 2023

vrfn commented May 12, 2023

genzgd commented May 12, 2023

gladkikhtutu commented Jun 19, 2023

gladkikhtutu commented Jul 24, 2023

genzgd commented Nov 30, 2023

Support for Distributed table engine #14

Support for Distributed table engine #14

Comments

Volodin-DD commented Aug 27, 2021

silentsokolov commented Aug 27, 2021

Volodin-DD commented Aug 27, 2021

Volodin-DD commented Aug 31, 2021

kjdeluna commented Jan 27, 2022

Volodin-DD commented Feb 1, 2022

silentsokolov commented Feb 1, 2022

Volodin-DD commented Feb 1, 2022

Volodin-DD commented Feb 1, 2022

kjdeluna commented Feb 2, 2022

silentsokolov commented Feb 2, 2022

Volodin-DD commented Feb 3, 2022

kjdeluna commented Feb 4, 2022

gfunc commented Mar 28, 2022

gfunc commented Sep 29, 2022

genzgd commented Sep 29, 2022

gfunc commented Sep 30, 2022

genzgd commented Sep 30, 2022

simpl1g commented Jan 24, 2023

genzgd commented Jan 24, 2023

Keyeoh commented Mar 29, 2023

gladkikhtutu commented Apr 17, 2023

vrfn commented May 12, 2023

genzgd commented May 12, 2023

gladkikhtutu commented Jun 19, 2023

gladkikhtutu commented Jul 24, 2023

genzgd commented Nov 30, 2023