You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today, Liquid Clustering (LC) provides nice space locality features through the use of Hilbert curves to map multi dimensional space a 1 dimensional list of files.
Sometimes, tables are filtered much more often on some dimensions than on others. LC with Hilbert curves can hinder query performance on those queries, because it tries to evenly spread the data over all dimensions in an equal way. For tables which have dimensions on which filters are applied more often than on other dimensions, another space filling curve could increase performance.
This feature request concerns adding another curve than the Hilbert curve to cluster with.
Motivation
As an example, let's imagine a table with sensor measurements. Its properties are the following:
The users of this table are engineers who are interested in a few different signals, but never more than 50. Sometimes, they might only be interested in a specific time range but often they will want to query the full history of the signal.
Hilbert curve problem
The issue with the hilbert curve is visualized here:
You see that when filtering on a specific signal_name (in red) and not on a timestamp, our query will need to read from data that ends up in very different parts of the hilbert curves (showed with the green circles). This means that we have to read very different parquet files, who also contain data from other signals which are not of our interest in this query.
Solution
Give the user the possibility to use another space filling curve with which they can prioritize certain dimensions over others. In the previous example our signal_name was the dimension of priority. But to generalise this: Dim 1 would take priority over Dim 2 in the following picture:
Further details
The functionality could be largely equal to what exists today, the only thing that should change is how the DataFrame is repartitioned. That means that in MultiDimClustering.cluster, a new case should be added where we refer to a new object (next to ZOrderClustering and HilbertClustering ). This is where we would implement that new curve.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.
The text was updated successfully, but these errors were encountered:
koedlt
changed the title
[Feature Request][ Add another type of space filling curve as option when using Liquid Clustering
[Feature Request][Spark] Add another type of space filling curve as option when using Liquid Clustering
Nov 22, 2024
Feature request
Which Delta project/connector is this regarding?
Overview
Today, Liquid Clustering (LC) provides nice space locality features through the use of Hilbert curves to map multi dimensional space a 1 dimensional list of files.
Sometimes, tables are filtered much more often on some dimensions than on others. LC with Hilbert curves can hinder query performance on those queries, because it tries to evenly spread the data over all dimensions in an equal way. For tables which have dimensions on which filters are applied more often than on other dimensions, another space filling curve could increase performance.
This feature request concerns adding another curve than the Hilbert curve to cluster with.
Motivation
As an example, let's imagine a table with sensor measurements. Its properties are the following:
Schema
timestamp
:org.apache.spark.sql.types.TimestampType
signal_name
:org.apache.spark.sql.types.StringType
value
:org.apache.spark.sql.types.StringType
Number of rows
1 trillion (10^12)
Cardinality of the
signal_name
column100k unique values
Queries hitting the table
The users of this table are engineers who are interested in a few different signals, but never more than 50. Sometimes, they might only be interested in a specific time range but often they will want to query the full history of the signal.
Hilbert curve problem
The issue with the hilbert curve is visualized here:
You see that when filtering on a specific
signal_name
(in red) and not on a timestamp, our query will need to read from data that ends up in very different parts of the hilbert curves (showed with the green circles). This means that we have to read very different parquet files, who also contain data from other signals which are not of our interest in this query.Solution
Give the user the possibility to use another space filling curve with which they can prioritize certain dimensions over others. In the previous example our
signal_name
was the dimension of priority. But to generalise this: Dim 1 would take priority over Dim 2 in the following picture:Further details
The functionality could be largely equal to what exists today, the only thing that should change is how the
DataFrame
is repartitioned. That means that inMultiDimClustering.cluster
, a newcase
should be added where we refer to a new object (next to ZOrderClustering and HilbertClustering ). This is where we would implement that new curve.Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: