[Feature Request] Liquid Clustering #1874

tdas · 2023-06-28T18:39:16Z

Overview

We propose to introduce Liquid Clustering, a new effort to revamp how clustering works in Delta, which addresses the shortcomings of Hive-style partitioning and current ZORDER clustering.

Motivation

Partitioning/clustering is a common technique to manage datasets to reduce unnecessary data processing. Hive-style partitioning and ZORDER (multi-dimensional) clustering are existing solutions in Delta, but both have limitations.

Hive-style partitioning

Hive-style partitioning clusters data such that every file contains exactly one distinct combination of partition column values. Although Hive-style partitioning works well if tuned correctly, there are limitations:

Since partition values are physical boundaries for files, Hive-style partitioning by high cardinality columns will create many small files that cannot be combined, resulting in poor scan performance.
In Spark/Delta, once a table is partitioned, its partition strategy cannot be changed, thus being unable to adapt to new use cases such as query pattern changes, etc.

ZORDER clustering

ZORDER is a muti-dimensional clustering technique used in Delta. The OPTIMIZE ZORDER BY command applies ZORDER clustering and improves the performance of queries that utilize ZORDER BY columns in their predicates. However, it has the following limitations:

Any new data ingested after the OPTIMIZE ZORDER BY run is not automatically clustered, and the user needs to rerun the command to cluster the new data.
OPTIMIZE ZORDER BY reclusters already well-clustered data, resulting in high write amplification.
ZORDER BY columns are not persisted and the user is required to remember the previous ZORDER BY columns, often causing user errors.

Detailed Design

Please refer to the document here.

The text was updated successfully, but these errors were encountered:

akshaysinghas · 2023-06-29T08:11:34Z

Hi guys, do we have any white paper or link that I can refer to understand MDC, or liquid clustering? I know requirement doc is on the way but any other resource that can throw more light on this feature?

felipepessoto · 2023-07-10T22:50:13Z

This feature in on Delta Lake 3.0.0 Preview release notes, is it expected to have an 3.0.0 RC2 with the feature?

Usually, the Delta final version is released few weeks after RC1. It seems this won't make to the 3.0.0 final?

kangkaisen · 2023-07-12T03:44:36Z

@tdas Hi，does Liquid Clustering use the DBSCAN algorithm? Thanks.

nachiketrajput-upl · 2023-07-18T09:18:47Z

@akshaysinghUdaan: Kindly use the link, https://www.databricks.com/blog/announcing-delta-lake-30-new-universal-format-and-liquid-clustering

loquisgon · 2023-08-16T17:55:25Z

This needs a LOT more detail: "Liquid clustering uses a continuous fractal space filling curve as a multi-dimensional clustering technique, which significantly improves data skipping"

loquisgon · 2023-08-16T17:59:28Z

"For multidimensional databases, Hilbert order has been proposed to be used instead of Z order because it has better locality-preserving behavior. For example, Hilbert curves have been used to compress and accelerate R-tree indexes[9] (see Hilbert R-tree). They have also been used to help compress data warehouses.[10][11]"

https://en.wikipedia.org/wiki/Hilbert_curve#:~:text=The%20Hilbert%20curve%20(also%20known,by%20Giuseppe%20Peano%20in%201890.

imback82 · 2023-08-17T03:21:22Z

This needs a LOT more detail: "Liquid clustering uses a continuous fractal space filling curve as a multi-dimensional clustering technique, which significantly improves data skipping"

@loquisgon we are working on the design doc and we will have more details on clustering algorithm. Stay tuned!

KamilKandzia · 2023-09-19T17:37:25Z

@imback82 Any progress on the doc or implementing solution in delta project? Databricks 13.2 has already implemented the liquid clustering, but we still have lack of information about this solution

dabao521 · 2023-10-24T23:09:30Z

@vkorukanti can you assign this issue to me?

zedtang · 2023-11-03T21:55:47Z

Hello everyone, we've added the design doc to this issue. Please feel free to review it. Thank you!

felipepessoto · 2023-11-03T22:18:49Z

@zedtang if you don't mind to share the doc using this option which allows to print and comment (if you select Commenter). Even if you leave Viewer permission only, at least that option allow us to print it.

https://support.google.com/drive/answer/2494822?hl=en&co=GENIE.Platform%3DDesktop#zippy=%2Cshare-a-file-publicly

Previous design docs were shared that way

zedtang · 2023-11-09T22:12:41Z

Hi @felipepessoto , I updated the link to give everyone commenter access!

zmeir · 2023-11-10T07:05:01Z

Hi, will it be possible to use liquid clustering on partitioned tables?

We propose to introduce a new feature **Clustered Table** to the Delta spec. The Clustered Table feature facilitates the physical clustering of rows that share similar values on a predefined set of clustering columns. This enhances query performance when selective filters are applied to these clustering columns through data skipping. More details can be found in the github issue [here](#1874). Closes #2264 GitOrigin-RevId: db124f01b8a8bfa06367700fbabb588c5b03497b

#1874 requests Liquid clustering, and this PR starts the first step to introduce ClusteringTableFeature and CLUSTERED_BY tags. When creating a clustered table, The feature clustering must exist in the table protocol's writerFeatures. When a clustering implementation clusters files, writers must incorporate a tag with CLUSTERED_BY as the key and the name of the clustering implementation as the corresponding value in add action. More detail can be found in the Delta protocol change PR #2264 The next step is to pave the way to integrate the table feature and clusterby tags when defining and clustering a clustered table. Closes #2281 GitOrigin-RevId: e210b491a324a0794ec9f3a9236bb1932a6677e3

ghost · 2023-11-29T00:55:41Z

In the design doc I don't see any info if clustering can be applied to existing Delta tables. As we use liquid from databricks already we observed that either you have to apply liquid to an empty table or you have to use the like command which basically copies and creates a new Delta table

I guess it will be really frustrating if we can't apply liquid to existing Delta tables :)

Thanks already everyone

auckenth · 2023-12-14T15:59:07Z

According to https://delta.io/pdfs/DLTDG_ER3.pdf (page 131) k-d trees are used to implement liquid clustering instead of space filling curves. Are there any pros and cons for an implementation using k-d trees?

imback82 · 2023-12-14T19:08:19Z

In the design doc I don't see any info if clustering can be applied to existing Delta tables. As we use liquid from databricks already we observed that either you have to apply liquid to an empty table or you have to use the like command which basically copies and creates a new Delta table

I guess it will be really frustrating if we can't apply liquid to existing Delta tables :)

Good point. I think we should be able to migrate existing tables in place by ALTER TABLE ... CLUSTER BY .... This should be straightforward for un-partitioned tables, but it will be tricky to implement it for partitioned tables.

dennyglee · 2023-12-14T21:49:44Z

According to https://delta.io/pdfs/DLTDG_ER3.pdf (page 131) k-d trees are used to implement liquid clustering instead of space filling curves. Are there any pros and cons for an implementation using k-d trees?

Oh sorry about that @auckenth - this is an error within the book. We will be updating the book chapter with the latest info per the design doc.

This PR is part of delta-io#1874. This PR implements a new data clustering algorithm based on Hilbert curve. No code uses this new implementation yet. Will implement incremental clustering using ZCube in follow-up PRs. Design can be found at: https://docs.google.com/document/d/1FWR3odjOw4v4-hjFy_hVaNdxHVs4WuK1asfB6M6XEMw/edit#heading=h.uubbjjd24plb. Closes delta-io#2314 GitOrigin-RevId: abafaa717ba8f7d8809114858c0fd2a25861fcb8

acruise · 2024-03-27T17:04:14Z

It'd be nice if someone could address comments and questions in the design doc. :)

adriangb · 2024-04-20T21:16:23Z

Sorry if this is not the right place to ask, but is there anyway with liquid cluster to enforce some level of clustering? In particular I want to cluster by something like project,day and implement retention such that I delete data older than X days which is different for each project. With hive style I can partition on those columns and delete entire files without rewrites.

zedtang · 2024-05-23T01:29:29Z

It'd be nice if someone could address comments and questions in the design doc. :)

Sorry for the delay, I replied in the design doc.

sezruby · 2024-06-27T20:24:50Z

@zedtang Are you planning to support "cluster on write" ?

zedtang · 2024-08-09T13:42:36Z

@zedtang Are you planning to support "cluster on write" ?

Hi @sezruby , May I ask what your use case is for this?

sezruby · 2024-08-13T00:41:11Z

@zedtang With the feature, we don't need additional OPTIMIZE run after ingesting large data.
We may integrate it with optimize write feature for non-partitioned data.

zedtang · 2024-08-27T00:15:01Z

It's probably tricky to implement clustering on write. (1) clustering works well only when a sufficient amount of data has accumulated and can be clustered together, so doing on every write (commonly small writes) may not produce well-clustered data at the end, and one has to run offline optimize jobs anyway (2) clustering on write will require some kind of sorting, which will come at a cost to write latency. so it's not an easy decision.

It would be very cool if someone proposes a design and shows that it will work well for some workloads without sacrificing write latency/throughput significantly

orthoxerox · 2024-10-25T07:21:23Z

@zedtang Are you planning to support "cluster on write" ?

Hi @sezruby , May I ask what your use case is for this?

Appending large amounts of data works well, because liquid clustering is incremental. The only drawback is that you need 2x storage for appended data if you want to preserve time travel.

But if you try to merge large amounts of data (but still small compared to the rest of the table, think 200GB merged into 10TB), you have basically three options now:

not use deletion vectors, MERGE touches the majority of the files, then OPTIMIZE rewrites the majority of the files, you have just turned 10TB into 30TB
use deletion vectors, MERGE writes 200GB of new data and writes the updates as deletion vectors, then OPTIMIZE applies the deletion vectors and rewrites the majority of the files, you have just turned 10TB into 20TB
use deletion vectors, MERGE writes 200GB of new data and writes the updates as deletion vectors, then you don't run OPTIMIZE immediately, but run it once every N merges to save space

I can think of two features either of which can help with this:

have MERGE cluster the data on write if deletion vectors are enabled, so you don't have to run OPTIMIZE to cluster the data
create an additional form of OPTIMIZE that doesn't touch the deletion vectors and can be run after each merge

orthoxerox · 2024-10-29T11:16:03Z

I've tested OPTIMIZE + liquid clustering + deletion vectors and Delta Lake wrote much less data than I expected:

I merged 100GB of data that had about 25% of updates into 1TB table. Right now my S3 bucket looks like this:

~1TB of old data
~50GB of parquets written during MERGE, which sounds like too little data, I would expect ~100GB
~50MB of deletion vectors written during MERGE
~125GB of parquets written during OPTIMIZE

If OPTIMIZE had to rewrite every file touched by a deletion vector, given that my updates are totally random, I would expect it to write 100GB + 1000GB of data, not 100GB + 25GB.

But on a smaller-sized table (100GB of old data + 10GB increment) I ended up with 220GB, just as I predicted. I am quite confused now.

tdas added the enhancement New feature or request label Jun 28, 2023

vkorukanti added this to the 3.0.0 milestone Jul 19, 2023

tdas removed this from the 3.0.0 milestone Sep 26, 2023

vkorukanti added this to the 3.1.0 milestone Oct 24, 2023

vkorukanti assigned dabao521 Oct 25, 2023

dabao521 mentioned this issue Nov 2, 2023

Introduce Clustered Table to delta spec #2264

Closed

5 tasks

dabao521 mentioned this issue Nov 9, 2023

[Clustered Table] Introduce ClusteringTableFeature and CLUSTERED_BY file tag #2281

Closed

5 tasks

tdas added this to Linux Foundation Delta Lake Roadmap Nov 14, 2023

tdas moved this to In Progress in Linux Foundation Delta Lake Roadmap Nov 15, 2023

weiluo-db mentioned this issue Nov 20, 2023

[Spark] Implement Hilbert clustering #2314

Closed

5 tasks

zedtang mentioned this issue Nov 22, 2023

[Spark] Delta SQL parser change to support CLUSTER BY #2328

Closed

5 tasks

wjones127 mentioned this issue Jan 7, 2024

Add liquid clustering delta-io/delta-rs#2043

Open

vkorukanti modified the milestones: 3.1.0, 3.2.0 Jan 30, 2024

zedtang mentioned this issue Feb 1, 2024

[Feature Request] Delta SQL parser support for CLUSTER BY syntax #2593

Closed

8 tasks

osopardo1 mentioned this issue Feb 22, 2024

Upgrade to Delta 3.1.0 Qbeast-io/qbeast-spark#275

Closed

7 tasks

keen85 mentioned this issue Mar 30, 2024

Blog post proposal: repartition Delta Table (with generated partition column) delta-io/website#387

Closed

scottsand-db moved this from In Progress to Done in Linux Foundation Delta Lake Roadmap Apr 29, 2024

scottsand-db moved this from Done to In Progress in Linux Foundation Delta Lake Roadmap Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Liquid Clustering #1874

[Feature Request] Liquid Clustering #1874

tdas commented Jun 28, 2023 •

edited by vkorukanti

Loading

akshaysinghas commented Jun 29, 2023

felipepessoto commented Jul 10, 2023

kangkaisen commented Jul 12, 2023

nachiketrajput-upl commented Jul 18, 2023

loquisgon commented Aug 16, 2023

loquisgon commented Aug 16, 2023 •

edited

Loading

imback82 commented Aug 17, 2023

KamilKandzia commented Sep 19, 2023

dabao521 commented Oct 24, 2023

zedtang commented Nov 3, 2023

felipepessoto commented Nov 3, 2023 •

edited

Loading

zedtang commented Nov 9, 2023

zmeir commented Nov 10, 2023

ghost commented Nov 29, 2023

auckenth commented Dec 14, 2023

imback82 commented Dec 14, 2023

dennyglee commented Dec 14, 2023

acruise commented Mar 27, 2024

adriangb commented Apr 20, 2024

zedtang commented May 23, 2024

sezruby commented Jun 27, 2024

zedtang commented Aug 9, 2024

sezruby commented Aug 13, 2024

zedtang commented Aug 27, 2024

orthoxerox commented Oct 25, 2024

orthoxerox commented Oct 29, 2024 •

edited

Loading

[Feature Request] Liquid Clustering #1874

[Feature Request] Liquid Clustering #1874

Comments

tdas commented Jun 28, 2023 • edited by vkorukanti Loading

Overview

Motivation

Hive-style partitioning

ZORDER clustering

Detailed Design

akshaysinghas commented Jun 29, 2023

felipepessoto commented Jul 10, 2023

kangkaisen commented Jul 12, 2023

nachiketrajput-upl commented Jul 18, 2023

loquisgon commented Aug 16, 2023

loquisgon commented Aug 16, 2023 • edited Loading

imback82 commented Aug 17, 2023

KamilKandzia commented Sep 19, 2023

dabao521 commented Oct 24, 2023

zedtang commented Nov 3, 2023

felipepessoto commented Nov 3, 2023 • edited Loading

zedtang commented Nov 9, 2023

zmeir commented Nov 10, 2023

ghost commented Nov 29, 2023

auckenth commented Dec 14, 2023

imback82 commented Dec 14, 2023

dennyglee commented Dec 14, 2023

acruise commented Mar 27, 2024

adriangb commented Apr 20, 2024

zedtang commented May 23, 2024

sezruby commented Jun 27, 2024

zedtang commented Aug 9, 2024

sezruby commented Aug 13, 2024

zedtang commented Aug 27, 2024

orthoxerox commented Oct 25, 2024

orthoxerox commented Oct 29, 2024 • edited Loading

tdas commented Jun 28, 2023 •

edited by vkorukanti

Loading

loquisgon commented Aug 16, 2023 •

edited

Loading

felipepessoto commented Nov 3, 2023 •

edited

Loading

orthoxerox commented Oct 29, 2024 •

edited

Loading