-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Distributed table engine #14
Comments
I am intrigued. Distributed table is a useful feature. Community and I would be grateful if you could release this feature. Tell me if you need help. |
Great, so I am making a pull request. You can always pm me via telegram (https://t.me/volodin_dd) for less formal discussion in russian). Thanks a lot. |
#15 link to pull request |
@silentsokolov Any updates here? |
Very interested too. |
I think more specifics are needed to implement this issue. Because the distributed tables are part of a cluster. And "cluster management" is not a part of the dbt. Of course, we can use distributed tables but creation and management ... I dont know. |
Maybe you are right. I was using my solution and then using solution with running models on each shard with distributed-like view in the end (table function cluster). And second one is good. May be there is no reason for adding distributed materialization in adapter. Ephemeral would be more interesting) |
Above means, maybe this issue should be closed) |
Running models on each shard is really a good solution but on my end, this is not possible. One of the solutions that I've thought of is creating the replicated tables manually and then letting DBT take care of the Distributed table. I think this would work but when I tried it, it's generating an error because Distributed Table doesn't have ORDER BY but ORDER BY is required in table creation |
This sounds like a suggestion, I'll fix it soon |
Maybe this is a solution. You can use table functions in dbt models. And keep all "dbt stuff". For example: SELECT * FROM cluster('cluster_name', {{ source('schema', 'table_name') }}). The main problem is that you cannot insert in such tables, just select data from cluster. |
Can't we just use the Distributed table for both selects and inserts? |
using a fork with support for distributed engine as well, repo here my solution to distributed tables was to create the on cluster distributed table with the model name, in the meanwhile create a "real" table with the name And each type of materialization (whether table or incremental) would require the SQL to run on only one node ( without Would love to contribute once I reformat the code. |
I started to merge my approach toward distributed table engine. And I want to start a discussion early on which is about the handling of My production env has huge tables (200GB+ each shard), my approach was to use the SELECT DISTINCT {{partition_by}} FROM {{intermediate_relation}};
OPTIMIZE TABLE {{model_name}}_local ON CLUSTER {{cluster_name}} PARTITION {{partition_name}} [FINAL] DEDUPLICATE BY {{unique_keys}}, {{partition_column}}; But to be honest, despite the occasional lag of async distributed insert (which could be resolved by Any suggestions, any other solutions ? or is there any performance bump of |
Optimize is by nature an expensive operation, since it most cases it rewrites all of the data in the table. (You can also OPTIMIZE a ReplacingMergeTree table, which is essentially the same operation as your DEDUPLICATE query). Unfortunately there's no good answer for unique keys in ClickHouse -- this is a natural result of its sparse index/column oriented architecture. If you can't deduplicate before insert the available options are all less than ideal. The difficulties associated with deduplication/unique keys are one of main tradeoffs for performance. |
I agree. I tried to use My use cases for I would go ahead the implement my logic and give it a try. |
I agree it's worth a try. The current approach is "okay" for smaller tables, but I'm not at all surprised that incremental materializations get unusably slow when you get to "ClickHouse" scale data. If you could limit the incremental materialization to only the partitions actually affected it might help, but that assumes a level of intelligence in DBT that is going to be a lot of work and a lot of fine tuning for particular tables. |
@genzgd in today's dbt+ClickHouse webinar you answered that you don't need Distributed tables when using Replicated database, but I don't understand how is that possible? According to documentation https://clickhouse.com/docs/en/engines/database-engines/replicated/ we still need them. And I just checked in my cluster, I can't query data from all shards without creating Distributed tables. Am I missing something? In dbt community this is the first question I see when somebody joins. "How can I work with cluster in dbt?" And standart answer - "You need to rewrite materializations to correctly support Distributed tables". It would be great to resolve this issue |
Hi @simpl1g -- I believe you are correct. I've been thinking too much in ClickHouse Cloud terms which doesn't require sharding, but even with a Replicated Database it looks like you need a distributed table for multiple shards. We'll take another look at this issue and see where it fits on our roadmap. |
Hi there, have there been any updates on this issue? In the Clickhouse integrations' documentation, I have seen the following: Is this the recommended way to go? Does dbt create the distributed table by itself? How would it know the name of the cluster that needs to be passed as a parameter to the Distributed() engine definition? We currently have an on-premises Clickhouse with four shards and no replicas, and we would like to port our current ELT pipelines to dbt, but this is still raising a lot of doubts... Any hint or help would be much appreciated. |
I also support @simple argument. In our company we have 6 nodes, and we counted that if we run dbt 6 times in every node, we will get 6*6 distributed select queries to get full data and 6 on cluster inserts instead of one. It is too expensive for our infrastructure. |
Hi @genzgd, you commented that you'd take a look and see where this issue fits on the roadmap. Do you have any update on this? We'd love to be able to use distributed tables. |
@vrfn @simpl1g My apologies for not following up on this. The reality is that our team has had several other priorities over the past several months and the question of distributed tables and dbt is not currently on our development roadmap. We also don't have a good test framework or infrastructure for tests of distributed tables. We would consider an updated version of the original PR here https://github.com/ClickHouse/dbt-clickhouse/pull/15/files that has been tested in a real world situation and with disclaimers that support is purely experimental. Ideally the feature would only be enabled with an explicit setting and be well separated from "core" dbt functionality so as to not break existing use case. |
I added PR with distributed table materialization with actual structure, the main idea is the same as in above PR. #163 |
Added also distributed incremental materialization PR #172 |
Experimental Distributed table materializations added in #172 |
Hi. We have urge to use distributed engine for our ELT process. I have some work done (very unstable, but somewhat tested) in my fork of your repo Still have many work to do, but dbt run creates distributed table on cluster and inserts values from select there. In very small test everything works as expected.
If you are already working on this feature or have plans (it seems good to have this feature in my opinion), maybe I can help.
The text was updated successfully, but these errors were encountered: