From 30494708fbd582f2ec17c9522149dc0439be3e48 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 19 Dec 2023 16:12:00 +0800 Subject: [PATCH 01/21] Add docs about tables advanced feature like partitioning --- docs/advanced-table-feature.md | 65 +++++++++++++++++++++++++ docs/manage-metadata-using-gravitino.md | 2 + 2 files changed, 67 insertions(+) create mode 100644 docs/advanced-table-feature.md diff --git a/docs/advanced-table-feature.md b/docs/advanced-table-feature.md new file mode 100644 index 00000000000..e22ddfa98c8 --- /dev/null +++ b/docs/advanced-table-feature.md @@ -0,0 +1,65 @@ +#### Partitioned table + +Currently, Gravitino supports the following partitioning strategies: + +| Partitioning strategy | Json | Java | SQL syntax | Description | +|-----------------------|-----------------------------------------------------|--------------------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------| +| Identity | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | Partition by a field or reference | +| function | `{"strategy":"functionName","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by a function, currently, we support currently function, hour, year, day, bucket, month, truncate, list and range | + +The detail of function strategies is as follows: + +| Function strategy | Json | Java | SQL syntax | Description | +|-------------------|------------------------------------------------------------------|------------------------------------------------|------------------------------------|--------------------------------------------------------| +| Identity | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | Partition by field `score` | +| Hour | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by `hour` function in field `score` | +| Day | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | Partition by `day` function in field `score` | +| Month | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | Partition by `month` function in field `score` | +| Year | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | Partition by `year` function in field `score` | +| Bucket | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | Partition by `bucket` function in field `score` | +| Truncate | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | Partition by `truncate` function in field `score` | +| List | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | Partition by `list` function in fields `dt` and `city` | +| Range | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | Partition by `range` function in field `score` | + +Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"functionName","fieldName":["score"]}` is equivalent to `PARTITION BY functionName(score)` in SQL. +For complex function, please refer to `FunctionPartitioningDTO`. + +#### Bucketed table + +- Strategy. It defines in which way we bucket the table. + +| Bucket strategy | Json | Java | Description | +|-----------------|---------|------------------|--------------------------| +| HASH | `HASH` | `Strategy.HASH` | Bucket table using hash | +| RANGE | `RANGE` | `Strategy.RANGE` | Bucket table using range | +| EVEN | `EVEN` | `Strategy.EVEN` | Bucket table using | + +- Number. It defines how many buckets we use to bucket the table. +- Function arguments. It defines which field or function should be used to bucket the table. Please refer to Java class `FunctionArg` and `DistributionDTO`. + +| Expression type | Json | Java | SQL syntax | Description | +|-----------------|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-----------------|--------------------------------| +| Field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | +| Function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | +| Constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | + + +#### Sorted order table + +To define a sorted order table, you should use the following three components to construct a valid sorted order table. + +- Direction. It defines in which direction we sort the table. + +| Direction | Json | Java | Description | +| ---------- | ------ | -------------------------- |-------------------------------------------| +| Ascending | `asc` | `SortDirection.ASCENDING` | Sorted by a field or a function ascending | +| Descending | `desc` | `SortDirection.DESCENDING` | Sorted by a field or a function ascending | + +- Null ordering. It describes how to handle null value when ordering + +| Null ordering | Json | Java | Description | +| --------------------------------- | ------------- | -------------------------- |-----------------------------------| +| Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | Put null value in the first place | +| Put null value int the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | Put null value in the last place | + +- Sort term. It shows which field or function should be used to sort the table, please see the `Argument type` in the bucketed table. \ No newline at end of file diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index e807e8a0b80..de789b4a81e 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -733,6 +733,8 @@ In addition to the basic settings, Gravitino supports the following features: | Bucketed table | Equal to `CLUSTERED BY` in Apache Hive, some engine may use different words to describe it. | [Distribution](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/distributions/Distribution.html) | | Sorted order table | Equal to `SORTED BY` in Apache Hive, some engine may use different words to describe it. | [SortOrder](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/sorts/SortOrder.html) | +The detail doc about these three features is [here](advanced-table-feature.md). + :::tip **Not all catalogs may support those features.**. Please refer to the related document for more details. ::: From 1ac227085f02feabcfb1016fefd63e8b5d89dd77 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 19 Dec 2023 17:54:03 +0800 Subject: [PATCH 02/21] Add docs about tables advanced feature like partitioning --- docs/advanced-table-feature.md | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/docs/advanced-table-feature.md b/docs/advanced-table-feature.md index e22ddfa98c8..cc3ccad53e2 100644 --- a/docs/advanced-table-feature.md +++ b/docs/advanced-table-feature.md @@ -1,7 +1,19 @@ +--- +title: "Advanced table feature" +slug: /advanced-table-feature +date: 2023-12-19 +keyword: partitioning bucket distribution sort order +license: Copyright 2023 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2. +--- + #### Partitioned table Currently, Gravitino supports the following partitioning strategies: +:::note +The `score`, `dt` and `city` are the field names in the table. +::: + | Partitioning strategy | Json | Java | SQL syntax | Description | |-----------------------|-----------------------------------------------------|--------------------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------| | Identity | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | Partition by a field or reference | @@ -28,11 +40,11 @@ For complex function, please refer to `FunctionPartitioningDTO`. - Strategy. It defines in which way we bucket the table. -| Bucket strategy | Json | Java | Description | -|-----------------|---------|------------------|--------------------------| -| HASH | `HASH` | `Strategy.HASH` | Bucket table using hash | -| RANGE | `RANGE` | `Strategy.RANGE` | Bucket table using range | -| EVEN | `EVEN` | `Strategy.EVEN` | Bucket table using | +| Bucket strategy | Json | Java | Description | +|-----------------|---------|------------------|---------------------------------------------------------------------------------------------| +| HASH | `HASH` | `Strategy.HASH` | Bucket table using hash | +| RANGE | `RANGE` | `Strategy.RANGE` | Bucket table using range | +| EVEN | `EVEN` | `Strategy.EVEN` | Bucket table using even, The data will be bucketed equally according to the amount of data. | - Number. It defines how many buckets we use to bucket the table. - Function arguments. It defines which field or function should be used to bucket the table. Please refer to Java class `FunctionArg` and `DistributionDTO`. From 31677a91a2bbb8b7fc6a1364970b167614583644 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 19 Dec 2023 19:32:21 +0800 Subject: [PATCH 03/21] Resolve discussion --- docs/advanced-table-feature.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/advanced-table-feature.md b/docs/advanced-table-feature.md index cc3ccad53e2..4294d99ceff 100644 --- a/docs/advanced-table-feature.md +++ b/docs/advanced-table-feature.md @@ -17,13 +17,12 @@ The `score`, `dt` and `city` are the field names in the table. | Partitioning strategy | Json | Java | SQL syntax | Description | |-----------------------|-----------------------------------------------------|--------------------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------| | Identity | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | Partition by a field or reference | -| function | `{"strategy":"functionName","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by a function, currently, we support currently function, hour, year, day, bucket, month, truncate, list and range | +| Function | `{"strategy":"functionName","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by a function, currently, we support currently function, hour, year, day, bucket, month, truncate, list and range | The detail of function strategies is as follows: | Function strategy | Json | Java | SQL syntax | Description | |-------------------|------------------------------------------------------------------|------------------------------------------------|------------------------------------|--------------------------------------------------------| -| Identity | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | Partition by field `score` | | Hour | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by `hour` function in field `score` | | Day | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | Partition by `day` function in field `score` | | Month | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | Partition by `month` function in field `score` | @@ -42,12 +41,12 @@ For complex function, please refer to `FunctionPartitioningDTO`. | Bucket strategy | Json | Java | Description | |-----------------|---------|------------------|---------------------------------------------------------------------------------------------| -| HASH | `HASH` | `Strategy.HASH` | Bucket table using hash | -| RANGE | `RANGE` | `Strategy.RANGE` | Bucket table using range | -| EVEN | `EVEN` | `Strategy.EVEN` | Bucket table using even, The data will be bucketed equally according to the amount of data. | +| HASH | `hash` | `Strategy.HASH` | Bucket table using hash | +| RANGE | `range` | `Strategy.RANGE` | Bucket table using range | +| EVEN | `even` | `Strategy.EVEN` | Bucket table using even, The data will be bucketed equally according to the amount of data. | - Number. It defines how many buckets we use to bucket the table. -- Function arguments. It defines which field or function should be used to bucket the table. Please refer to Java class `FunctionArg` and `DistributionDTO`. +- Function arguments. It defines which field or function should be used to bucket the table. Gravitino supports the following three kinds of arguments, for more, you can refer to Java class `FunctionArg` and `DistributionDTO` to use more complex function arguments. | Expression type | Json | Java | SQL syntax | Description | |-----------------|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-----------------|--------------------------------| @@ -74,4 +73,4 @@ To define a sorted order table, you should use the following three components to | Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | Put null value in the first place | | Put null value int the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | Put null value in the last place | -- Sort term. It shows which field or function should be used to sort the table, please see the `Argument type` in the bucketed table. \ No newline at end of file +- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. \ No newline at end of file From 164ddf06bb77fdf23910edaad1e931b65da0333e Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 19 Dec 2023 20:04:34 +0800 Subject: [PATCH 04/21] Resolve discussion --- docs/advanced-table-feature.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/advanced-table-feature.md b/docs/advanced-table-feature.md index 4294d99ceff..89a65fff1ae 100644 --- a/docs/advanced-table-feature.md +++ b/docs/advanced-table-feature.md @@ -14,14 +14,14 @@ Currently, Gravitino supports the following partitioning strategies: The `score`, `dt` and `city` are the field names in the table. ::: -| Partitioning strategy | Json | Java | SQL syntax | Description | +| Partitioning strategy | Json | Java | Equivalent SQL semantics | Description | |-----------------------|-----------------------------------------------------|--------------------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------| | Identity | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | Partition by a field or reference | | Function | `{"strategy":"functionName","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by a function, currently, we support currently function, hour, year, day, bucket, month, truncate, list and range | The detail of function strategies is as follows: -| Function strategy | Json | Java | SQL syntax | Description | +| Function strategy | Json example | Java example | Equivalent SQL semantics | Description | |-------------------|------------------------------------------------------------------|------------------------------------------------|------------------------------------|--------------------------------------------------------| | Hour | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by `hour` function in field `score` | | Day | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | Partition by `day` function in field `score` | @@ -48,11 +48,11 @@ For complex function, please refer to `FunctionPartitioningDTO`. - Number. It defines how many buckets we use to bucket the table. - Function arguments. It defines which field or function should be used to bucket the table. Gravitino supports the following three kinds of arguments, for more, you can refer to Java class `FunctionArg` and `DistributionDTO` to use more complex function arguments. -| Expression type | Json | Java | SQL syntax | Description | -|-----------------|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-----------------|--------------------------------| -| Field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | -| Function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | -| Constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | +| Expression type | Json example | Java example | Equivalent SQL semantics | Description | +|-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------| +| Field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | +| Function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | +| Constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | #### Sorted order table From bfd28028b1c3eab65de2dcbcbbd63fb26462f8b2 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 19 Dec 2023 20:35:06 +0800 Subject: [PATCH 05/21] Resolve discussion again --- docs/advanced-table-feature.md | 76 ------------------------- docs/manage-metadata-using-gravitino.md | 64 ++++++++++++++++++++- 2 files changed, 63 insertions(+), 77 deletions(-) delete mode 100644 docs/advanced-table-feature.md diff --git a/docs/advanced-table-feature.md b/docs/advanced-table-feature.md deleted file mode 100644 index 89a65fff1ae..00000000000 --- a/docs/advanced-table-feature.md +++ /dev/null @@ -1,76 +0,0 @@ ---- -title: "Advanced table feature" -slug: /advanced-table-feature -date: 2023-12-19 -keyword: partitioning bucket distribution sort order -license: Copyright 2023 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2. ---- - -#### Partitioned table - -Currently, Gravitino supports the following partitioning strategies: - -:::note -The `score`, `dt` and `city` are the field names in the table. -::: - -| Partitioning strategy | Json | Java | Equivalent SQL semantics | Description | -|-----------------------|-----------------------------------------------------|--------------------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------| -| Identity | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | Partition by a field or reference | -| Function | `{"strategy":"functionName","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by a function, currently, we support currently function, hour, year, day, bucket, month, truncate, list and range | - -The detail of function strategies is as follows: - -| Function strategy | Json example | Java example | Equivalent SQL semantics | Description | -|-------------------|------------------------------------------------------------------|------------------------------------------------|------------------------------------|--------------------------------------------------------| -| Hour | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | Partition by `hour` function in field `score` | -| Day | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | Partition by `day` function in field `score` | -| Month | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | Partition by `month` function in field `score` | -| Year | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | Partition by `year` function in field `score` | -| Bucket | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | Partition by `bucket` function in field `score` | -| Truncate | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | Partition by `truncate` function in field `score` | -| List | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | Partition by `list` function in fields `dt` and `city` | -| Range | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | Partition by `range` function in field `score` | - -Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"functionName","fieldName":["score"]}` is equivalent to `PARTITION BY functionName(score)` in SQL. -For complex function, please refer to `FunctionPartitioningDTO`. - -#### Bucketed table - -- Strategy. It defines in which way we bucket the table. - -| Bucket strategy | Json | Java | Description | -|-----------------|---------|------------------|---------------------------------------------------------------------------------------------| -| HASH | `hash` | `Strategy.HASH` | Bucket table using hash | -| RANGE | `range` | `Strategy.RANGE` | Bucket table using range | -| EVEN | `even` | `Strategy.EVEN` | Bucket table using even, The data will be bucketed equally according to the amount of data. | - -- Number. It defines how many buckets we use to bucket the table. -- Function arguments. It defines which field or function should be used to bucket the table. Gravitino supports the following three kinds of arguments, for more, you can refer to Java class `FunctionArg` and `DistributionDTO` to use more complex function arguments. - -| Expression type | Json example | Java example | Equivalent SQL semantics | Description | -|-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------| -| Field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | -| Function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | -| Constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | - - -#### Sorted order table - -To define a sorted order table, you should use the following three components to construct a valid sorted order table. - -- Direction. It defines in which direction we sort the table. - -| Direction | Json | Java | Description | -| ---------- | ------ | -------------------------- |-------------------------------------------| -| Ascending | `asc` | `SortDirection.ASCENDING` | Sorted by a field or a function ascending | -| Descending | `desc` | `SortDirection.DESCENDING` | Sorted by a field or a function ascending | - -- Null ordering. It describes how to handle null value when ordering - -| Null ordering | Json | Java | Description | -| --------------------------------- | ------------- | -------------------------- |-----------------------------------| -| Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | Put null value in the first place | -| Put null value int the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | Put null value in the last place | - -- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. \ No newline at end of file diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index de789b4a81e..77dda3aa31e 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -733,7 +733,69 @@ In addition to the basic settings, Gravitino supports the following features: | Bucketed table | Equal to `CLUSTERED BY` in Apache Hive, some engine may use different words to describe it. | [Distribution](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/distributions/Distribution.html) | | Sorted order table | Equal to `SORTED BY` in Apache Hive, some engine may use different words to describe it. | [SortOrder](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/sorts/SortOrder.html) | -The detail doc about these three features is [here](advanced-table-feature.md). +#### Partitioned table + +Currently, Gravitino supports the following partitioning strategies: + +:::note +The `score`, `dt` and `city` are the field names in the table. +::: + +| Function strategy | Description | Json example | Java example | Equivalent SQL semantics | +|-------------------|--------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------|------------------------------------| +| Identity | Partition by a field or reference | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | +| Hour | Partition by `hour` function in field `score` | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | +| Day | Partition by `day` function in field `score` | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | +| Month | Partition by `month` function in field `score` | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | +| Year | Partition by `year` function in field `score` | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | +| Bucket | Partition by `bucket` function in field `score` | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | +| Truncate | Partition by `truncate` function in field `score` | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | +| List | Partition by `list` function in fields `dt` and `city` | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | +| Range | Partition by `range` function in field `score` | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | + +Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL. +For complex function, please refer to `FunctionPartitioningDTO`. + +#### Bucketed table + +- Strategy. It defines in which way we bucket the table. + +| Bucket strategy | Json | Java | Description | +|-----------------|---------|------------------|---------------------------------------------------------------------------------------------| +| HASH | `hash` | `Strategy.HASH` | Bucket table using hash | +| RANGE | `range` | `Strategy.RANGE` | Bucket table using range | +| EVEN | `even` | `Strategy.EVEN` | Bucket table using even, The data will be bucketed equally according to the amount of data. | + +- Number. It defines how many buckets we use to bucket the table. +- Function arguments. It defines which field or function should be used to bucket the table. Gravitino supports the following three kinds of arguments, for more, you can refer to Java class `FunctionArg` and `DistributionDTO` to use more complex function arguments. + +| Expression type | Json example | Java example | Equivalent SQL semantics | Description | +|-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------| +| Field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | +| Function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | +| Constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | + + +#### Sorted order table + +To define a sorted order table, you should use the following three components to construct a valid sorted order table. + +- Direction. It defines in which direction we sort the table. + +| Direction | Json | Java | Description | +| ---------- | ------ | -------------------------- |-------------------------------------------| +| Ascending | `asc` | `SortDirection.ASCENDING` | Sorted by a field or a function ascending | +| Descending | `desc` | `SortDirection.DESCENDING` | Sorted by a field or a function ascending | + +- Null ordering. It describes how to handle null value when ordering + +| Null ordering | Json | Java | Description | +| --------------------------------- | ------------- | -------------------------- |-----------------------------------| +| Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | Put null value in the first place | +| Put null value int the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | Put null value in the last place | + +- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. + :::tip **Not all catalogs may support those features.**. Please refer to the related document for more details. From af0b3482b7cef2563985f281267fd06d15077783 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 19 Dec 2023 21:54:26 +0800 Subject: [PATCH 06/21] Update doc again --- docs/manage-metadata-using-gravitino.md | 147 +++++++++++++++++++----- 1 file changed, 118 insertions(+), 29 deletions(-) diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index 77dda3aa31e..754e9b20b6d 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -741,39 +741,100 @@ Currently, Gravitino supports the following partitioning strategies: The `score`, `dt` and `city` are the field names in the table. ::: -| Function strategy | Description | Json example | Java example | Equivalent SQL semantics | -|-------------------|--------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------|------------------------------------| -| Identity | Partition by a field or reference | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | -| Hour | Partition by `hour` function in field `score` | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | -| Day | Partition by `day` function in field `score` | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | -| Month | Partition by `month` function in field `score` | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | -| Year | Partition by `year` function in field `score` | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | -| Bucket | Partition by `bucket` function in field `score` | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | -| Truncate | Partition by `truncate` function in field `score` | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | -| List | Partition by `list` function in fields `dt` and `city` | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | -| Range | Partition by `range` function in field `score` | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | +| Function strategy | Description | Source types | Result type | Json example | Java example | Equivalent SQL semantics | +|-------------------|--------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------|-------------------------------------------------|------------------------------------| +| `identity` | Source value, unmodified | Any | Source type | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | +| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | timestamp`, timestamptz | int | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | +| `day` | Extract a date or timestamp day, as days from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | +| `month` | Extract a date or timestamp month, as months from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | +| `year` | Extract a date or timestamp year, as years from 1970 | date, timestamp, timestamptz | int | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | +| `bucket[N]` | Hash of value, mod N | int, long, decimal, date, time, timestamp, timestamptz, string, uuid, fixed, binary | int | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | +| `truncate[W]` | Value truncated to width W | int, long, decimal, string | Source type | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | +| `list` | Partition the table by a list value | Any | Any | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | +| `range` | Partition the table by a range value | Any | Any | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL. For complex function, please refer to `FunctionPartitioningDTO`. +The following is an example of creating a partitioned table: + + + + +```json +[ + { + "strategy": "identity", + "fieldName": [ + "score" + ] + } +] +``` + + + + +```java +new Transform[] { + // Partition by score + Transforms.identity("score") + } +``` + + + + + #### Bucketed table -- Strategy. It defines in which way we bucket the table. +- Strategy. It defines in which way you bucket the table. -| Bucket strategy | Json | Java | Description | -|-----------------|---------|------------------|---------------------------------------------------------------------------------------------| -| HASH | `hash` | `Strategy.HASH` | Bucket table using hash | -| RANGE | `range` | `Strategy.RANGE` | Bucket table using range | -| EVEN | `even` | `Strategy.EVEN` | Bucket table using even, The data will be bucketed equally according to the amount of data. | +| Bucket strategy | Description | Source types | Result type | Json | Java | +|-----------------|----------------------------------------------------------------------------------------------------------------------|--------------|-------------|---------|------------------| +| hash | Bucket table using hash. The data will be distributed into buckets based on the hash value of the key. | Any | int | `hash` | `Strategy.HASH` | +| range | Bucket table using range. The data will be divided into buckets based on a specified range or interval of values. | Any | Source type | `range` | `Strategy.RANGE` | +| even | Bucket table using even. The data will be evenly distributed into buckets, ensuring an equal distribution of data. | Any | Source type | `even` | `Strategy.EVEN` | -- Number. It defines how many buckets we use to bucket the table. +- Number. It defines how many buckets you use to bucket the table. - Function arguments. It defines which field or function should be used to bucket the table. Gravitino supports the following three kinds of arguments, for more, you can refer to Java class `FunctionArg` and `DistributionDTO` to use more complex function arguments. | Expression type | Json example | Java example | Equivalent SQL semantics | Description | |-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------| -| Field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | -| Function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | -| Constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | +| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | +| function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | +| constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | + + + + + +```json +{ + "strategy": "hash", + "number": 4, + "funcArgs": [ + { + "type": "field", + "fieldName": ["score"] + } + ] +} +``` + + + + +```java + new DistributionDTO.Builder() + .withStrategy(Strategy.HASH) + .withNumber(4) + .withArgs(FieldReferenceDTO.of("score")) + .build() +``` + + + #### Sorted order table @@ -783,18 +844,46 @@ To define a sorted order table, you should use the following three components to - Direction. It defines in which direction we sort the table. | Direction | Json | Java | Description | -| ---------- | ------ | -------------------------- |-------------------------------------------| -| Ascending | `asc` | `SortDirection.ASCENDING` | Sorted by a field or a function ascending | -| Descending | `desc` | `SortDirection.DESCENDING` | Sorted by a field or a function ascending | +|------------| ------ | -------------------------- |-------------------------------------------| +| ascending | `asc` | `SortDirection.ASCENDING` | Sorted by a field or a function ascending | +| descending | `desc` | `SortDirection.DESCENDING` | Sorted by a field or a function ascending | - Null ordering. It describes how to handle null value when ordering -| Null ordering | Json | Java | Description | -| --------------------------------- | ------------- | -------------------------- |-----------------------------------| -| Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | Put null value in the first place | -| Put null value int the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | Put null value in the last place | +| Null ordering Type | Json | Java | Description | +|--------------------| ------------- | -------------------------- |-----------------------------------| +| null_first | `nulls_first` | `NullOrdering.NULLS_FIRST` | Put null value in the first place | +| null_last | `nulls_last` | `NullOrdering.NULLS_LAST` | Put null value in the last place | -- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. +- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. + + + + +```json + { + "direction": "asc", + "nullOrder": "NULLS_LAST", + "sortTerm": { + "type": "field", + "fieldName": ["score"] + } +} +``` + + + + +```java + new SortOrderDTO.Builder() + .withDirection(SortDirection.ASCENDING) + .withNullOrder(NullOrdering.NULLS_LAST) + .withSortTerm(FieldReferenceDTO.of("score")) + .build() +``` + + + :::tip From d4c086ffc399886c554a1ba9aa76951bb8aa22f7 Mon Sep 17 00:00:00 2001 From: yuqi Date: Thu, 21 Dec 2023 21:27:21 +0800 Subject: [PATCH 07/21] Polish docs --- docs/manage-metadata-using-gravitino.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index 754e9b20b6d..c2e268023d4 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -738,7 +738,7 @@ In addition to the basic settings, Gravitino supports the following features: Currently, Gravitino supports the following partitioning strategies: :::note -The `score`, `dt` and `city` are the field names in the table. +The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table. ::: | Function strategy | Description | Source types | Result type | Json example | Java example | Equivalent SQL semantics | From 41582ddc72d2068dce00724a7c0ce5f48387251b Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 25 Dec 2023 14:35:53 +0800 Subject: [PATCH 08/21] Resolve discussion again --- docs/manage-metadata-using-gravitino.md | 46 ++++++++++++------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index c2e268023d4..9e135cfc8f0 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -741,20 +741,20 @@ Currently, Gravitino supports the following partitioning strategies: The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table. ::: -| Function strategy | Description | Source types | Result type | Json example | Java example | Equivalent SQL semantics | -|-------------------|--------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------|-------------------------------------------------|------------------------------------| -| `identity` | Source value, unmodified | Any | Source type | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | -| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | timestamp`, timestamptz | int | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | -| `day` | Extract a date or timestamp day, as days from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | -| `month` | Extract a date or timestamp month, as months from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | -| `year` | Extract a date or timestamp year, as years from 1970 | date, timestamp, timestamptz | int | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | -| `bucket[N]` | Hash of value, mod N | int, long, decimal, date, time, timestamp, timestamptz, string, uuid, fixed, binary | int | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | -| `truncate[W]` | Value truncated to width W | int, long, decimal, string | Source type | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | -| `list` | Partition the table by a list value | Any | Any | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | -| `range` | Partition the table by a range value | Any | Any | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | +| Partitioning strategy | Description | Source types | Result type | Json example | Java example | Equivalent SQL semantics | +|-----------------------|--------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------|-------------------------------------------------|------------------------------------| +| `identity` | Source value, unmodified | Any | Source type | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | +| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | timestamp`, timestamptz | int | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | +| `day` | Extract a date or timestamp day, as days from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | +| `month` | Extract a date or timestamp month, as months from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | +| `year` | Extract a date or timestamp year, as years from 1970 | date, timestamp, timestamptz | int | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | +| `bucket[N]` | Hash of value, mod N | int, long, decimal, date, time, timestamp, timestamptz, string, uuid, fixed, binary | int | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | +| `truncate[W]` | Value truncated to width W | int, long, decimal, string | Source type | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | +| `list` | Partition the table by a list value | Any | Any | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | +| `range` | Partition the table by a range value | Any | Any | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL. -For complex function, please refer to `FunctionPartitioningDTO`. +For complex function, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). The following is an example of creating a partitioned table: @@ -788,7 +788,7 @@ new Transform[] { #### Bucketed table -- Strategy. It defines in which way you bucket the table. +- Strategy. It defines how your table data is distributed across partitions. | Bucket strategy | Description | Source types | Result type | Json | Java | |-----------------|----------------------------------------------------------------------------------------------------------------------|--------------|-------------|---------|------------------| @@ -797,13 +797,13 @@ new Transform[] { | even | Bucket table using even. The data will be evenly distributed into buckets, ensuring an equal distribution of data. | Any | Source type | `even` | `Strategy.EVEN` | - Number. It defines how many buckets you use to bucket the table. -- Function arguments. It defines which field or function should be used to bucket the table. Gravitino supports the following three kinds of arguments, for more, you can refer to Java class `FunctionArg` and `DistributionDTO` to use more complex function arguments. +- Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments. | Expression type | Json example | Java example | Equivalent SQL semantics | Description | |-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------| | field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | | function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | -| constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | +| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | @@ -843,17 +843,17 @@ To define a sorted order table, you should use the following three components to - Direction. It defines in which direction we sort the table. -| Direction | Json | Java | Description | -|------------| ------ | -------------------------- |-------------------------------------------| -| ascending | `asc` | `SortDirection.ASCENDING` | Sorted by a field or a function ascending | -| descending | `desc` | `SortDirection.DESCENDING` | Sorted by a field or a function ascending | +| Direction | Description | Json | Java | +|------------|-------------------------------------------| ------ | -------------------------- | +| ascending | Sorted by a field or a function ascending | `asc` | `SortDirection.ASCENDING` | +| descending | Sorted by a field or a function descending| `desc` | `SortDirection.DESCENDING` | - Null ordering. It describes how to handle null value when ordering -| Null ordering Type | Json | Java | Description | -|--------------------| ------------- | -------------------------- |-----------------------------------| -| null_first | `nulls_first` | `NullOrdering.NULLS_FIRST` | Put null value in the first place | -| null_last | `nulls_last` | `NullOrdering.NULLS_LAST` | Put null value in the last place | +| Null ordering Type | Description | Json | Java | +|--------------------|-----------------------------------| ------------- | -------------------------- | +| null_first | Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | +| null_last | Put null value in the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | - Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. From a08a184ec3a4203bfab87ce204f1c7ddd463a507 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 25 Dec 2023 14:42:52 +0800 Subject: [PATCH 09/21] Remove the source type and result type column --- docs/manage-metadata-using-gravitino.md | 32 ++++++++++++------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index 9e135cfc8f0..0cf1d55d571 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -741,17 +741,17 @@ Currently, Gravitino supports the following partitioning strategies: The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table. ::: -| Partitioning strategy | Description | Source types | Result type | Json example | Java example | Equivalent SQL semantics | -|-----------------------|--------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------|-------------------------------------------------|------------------------------------| -| `identity` | Source value, unmodified | Any | Source type | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | -| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | timestamp`, timestamptz | int | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | -| `day` | Extract a date or timestamp day, as days from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | -| `month` | Extract a date or timestamp month, as months from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | -| `year` | Extract a date or timestamp year, as years from 1970 | date, timestamp, timestamptz | int | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | -| `bucket[N]` | Hash of value, mod N | int, long, decimal, date, time, timestamp, timestamptz, string, uuid, fixed, binary | int | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | -| `truncate[W]` | Value truncated to width W | int, long, decimal, string | Source type | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | -| `list` | Partition the table by a list value | Any | Any | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | -| `range` | Partition the table by a range value | Any | Any | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | +| Partitioning strategy | Description | Json example | Java example | Equivalent SQL semantics | +|-----------------------|--------------------------------------------------------------| ---------------------------------------------------------------- | ----------------------------------------------- |-----------------------------------| +| `identity` | Source value, unmodified | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | +| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | +| `day` | Extract a date or timestamp day, as days from 1970-01-01 | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | +| `month` | Extract a date or timestamp month, as months from 1970-01-01 | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | +| `year` | Extract a date or timestamp year, as years from 1970 | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | +| `bucket[N]` | Hash of value, mod N | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | +| `truncate[W]` | Value truncated to width W | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | +| `list` | Partition the table by a list value | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | +| `range` | Partition the table by a range value | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL. For complex function, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). @@ -790,11 +790,11 @@ new Transform[] { - Strategy. It defines how your table data is distributed across partitions. -| Bucket strategy | Description | Source types | Result type | Json | Java | -|-----------------|----------------------------------------------------------------------------------------------------------------------|--------------|-------------|---------|------------------| -| hash | Bucket table using hash. The data will be distributed into buckets based on the hash value of the key. | Any | int | `hash` | `Strategy.HASH` | -| range | Bucket table using range. The data will be divided into buckets based on a specified range or interval of values. | Any | Source type | `range` | `Strategy.RANGE` | -| even | Bucket table using even. The data will be evenly distributed into buckets, ensuring an equal distribution of data. | Any | Source type | `even` | `Strategy.EVEN` | +| Bucket strategy | Description | Json | Java | +|-----------------|----------------------------------------------------------------------------------------------------------------------|----------|------------------| +| hash | Bucket table using hash. The data will be distributed into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` | +| range | Bucket table using range. The data will be divided into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` | +| even | Bucket table using even. The data will be evenly distributed into buckets, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` | - Number. It defines how many buckets you use to bucket the table. - Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments. From 31ddcd4557ecf2864e20689f500a97f9474ac634 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 25 Dec 2023 15:03:55 +0800 Subject: [PATCH 10/21] Add description about default null ordering value --- docs/manage-metadata-using-gravitino.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index 004bdd5beff..4fdf395306c 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -858,6 +858,8 @@ To define a sorted order table, you should use the following three components to | null_first | Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | | null_last | Put null value in the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | +Note: If the direction value is `ascending`, the default ordering value is `nulls_first` and if the direction value is `descending`, the default ordering value is `nulls_last`. + - Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. From b70b3946db05cbfc3a7a5f9d09813877f337d887 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 25 Dec 2023 15:09:05 +0800 Subject: [PATCH 11/21] Use a separate doc to describe partitioning, bucketing and sorted table --- docs/manage-metadata-using-gravitino.md | 286 +---------------- ...table-partitioning-bucketing-sort-order.md | 287 ++++++++++++++++++ 2 files changed, 288 insertions(+), 285 deletions(-) create mode 100644 docs/table-partitioning-bucketing-sort-order.md diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index 4fdf395306c..c41c95b0edb 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -736,291 +736,7 @@ In addition to the basic settings, Gravitino supports the following features: | Bucketed table | Equal to `CLUSTERED BY` in Apache Hive, some engine may use different words to describe it. | [Distribution](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/distributions/Distribution.html) | | Sorted order table | Equal to `SORTED BY` in Apache Hive, some engine may use different words to describe it. | [SortOrder](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/sorts/SortOrder.html) | -#### Partitioned table - -Currently, Gravitino supports the following partitioning strategies: - -:::note -The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table. -::: - -| Partitioning strategy | Description | Json example | Java example | Equivalent SQL semantics | -|-----------------------|--------------------------------------------------------------| ---------------------------------------------------------------- | ----------------------------------------------- |-----------------------------------| -| `identity` | Source value, unmodified | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | -| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | -| `day` | Extract a date or timestamp day, as days from 1970-01-01 | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | -| `month` | Extract a date or timestamp month, as months from 1970-01-01 | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | -| `year` | Extract a date or timestamp year, as years from 1970 | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | -| `bucket[N]` | Hash of value, mod N | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | -| `truncate[W]` | Value truncated to width W | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | -| `list` | Partition the table by a list value | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | -| `range` | Partition the table by a range value | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | - -Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL. -For complex function, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). - -The following is an example of creating a partitioned table: - - - - -```json -[ - { - "strategy": "identity", - "fieldName": [ - "score" - ] - } -] -``` - - - - -```java -new Transform[] { - // Partition by score - Transforms.identity("score") - } -``` - - - - - -#### Bucketed table - -- Strategy. It defines how your table data is distributed across partitions. - -| Bucket strategy | Description | Json | Java | -|-----------------|----------------------------------------------------------------------------------------------------------------------|----------|------------------| -| hash | Bucket table using hash. The data will be distributed into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` | -| range | Bucket table using range. The data will be divided into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` | -| even | Bucket table using even. The data will be evenly distributed into buckets, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` | - -- Number. It defines how many buckets you use to bucket the table. -- Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments. - -| Expression type | Json example | Java example | Equivalent SQL semantics | Description | -|-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------| -| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | -| function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | -| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | - - - - - -```json -{ - "strategy": "hash", - "number": 4, - "funcArgs": [ - { - "type": "field", - "fieldName": ["score"] - } - ] -} -``` - - - - -```java - new DistributionDTO.Builder() - .withStrategy(Strategy.HASH) - .withNumber(4) - .withArgs(FieldReferenceDTO.of("score")) - .build() -``` - - - - - -#### Sorted order table - -To define a sorted order table, you should use the following three components to construct a valid sorted order table. - -- Direction. It defines in which direction we sort the table. - -| Direction | Description | Json | Java | -|------------|-------------------------------------------| ------ | -------------------------- | -| ascending | Sorted by a field or a function ascending | `asc` | `SortDirection.ASCENDING` | -| descending | Sorted by a field or a function descending| `desc` | `SortDirection.DESCENDING` | - -- Null ordering. It describes how to handle null value when ordering - -| Null ordering Type | Description | Json | Java | -|--------------------|-----------------------------------| ------------- | -------------------------- | -| null_first | Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | -| null_last | Put null value in the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | - -Note: If the direction value is `ascending`, the default ordering value is `nulls_first` and if the direction value is `descending`, the default ordering value is `nulls_last`. - -- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. - - - - -```json - { - "direction": "asc", - "nullOrder": "NULLS_LAST", - "sortTerm": { - "type": "field", - "fieldName": ["score"] - } -} -``` - - - - -```java - new SortOrderDTO.Builder() - .withDirection(SortDirection.ASCENDING) - .withNullOrder(NullOrdering.NULLS_LAST) - .withSortTerm(FieldReferenceDTO.of("score")) - .build() -``` - - - - - -:::tip -**Not all catalogs may support those features.**. Please refer to the related document for more details. -::: - -The following is an example of creating a partitioned, bucketed table and sorted order table: - - - - -```bash -curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ --H "Content-Type: application/json" -d '{ - "name": "table", - "columns": [ - { - "name": "id", - "type": "integer", - "nullable": true, - "comment": "Id of the user" - }, - { - "name": "name", - "type": "varchar(2000)", - "nullable": true, - "comment": "Name of the user" - }, - { - "name": "age", - "type": "short", - "nullable": true, - "comment": "Age of the user" - }, - { - "name": "score", - "type": "double", - "nullable": true, - "comment": "Score of the user" - } - ], - "comment": "Create a new Table", - "properties": { - "format": "ORC" - }, - "partitioning": [ - { - "strategy": "identity", - "fieldName": ["score"] - } - ], - "distribution": { - "strategy": "hash", - "number": 4, - "funcArgs": [ - { - "type": "field", - "fieldName": ["score"] - } - ] - }, - "sortOrders": [ - { - "direction": "asc", - "nullOrder": "NULLS_LAST", - "sortTerm": { - "type": "field", - "fieldName": ["name"] - } - } - ] -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables -``` - - - - -```java -tableCatalog.createTable( - NameIdentifier.of("metalake", "hive_catalog", "schema", "table"), - new ColumnDTO[] { - ColumnDTO.builder() - .withComment("Id of the user") - .withName("id") - .withDataType(Types.IntegerType.get()) - .withNullable(true) - .build(), - ColumnDTO.builder() - .withComment("Name of the user") - .withName("name") - .withDataType(Types.VarCharType.of(1000)) - .withNullable(true) - .build(), - ColumnDTO.builder() - .withComment("Age of the user") - .withName("age") - .withDataType(Types.ShortType.get()) - .withNullable(true) - .build(), - - ColumnDTO.builder() - .withComment("Score of the user") - .withName("score") - .withDataType(Types.DoubleType.get()) - .withNullable(true) - .build(), - }, - "Create a new Table", - tablePropertiesMap, - new Transform[] { - // Partition by id - Transforms.identity("score") - }, - // CLUSTERED BY id - new DistributionDTO.Builder() - .withStrategy(Strategy.HASH) - .withNumber(4) - .withArgs(FieldReferenceDTO.of("id")) - .build(), - // SORTED BY name asc - new SortOrderDTO[] { - new SortOrderDTO.Builder() - .withDirection(SortDirection.ASCENDING) - .withNullOrder(NullOrdering.NULLS_LAST) - .withSortTerm(FieldReferenceDTO.of("name")) - .build() - } - ); -``` - - - +For More about partition, distribution and sort order, please refer to the related [doc](table-partitioning-bucketing-sort-order.md). :::note The code above is an example of creating a Hive table. For other catalogs, the code is similar, but the supported column type, table properties may be different. For more details, please refer to the related doc. diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md new file mode 100644 index 00000000000..01bc6dcc077 --- /dev/null +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -0,0 +1,287 @@ + + +## Partitioned table + +Currently, Gravitino supports the following partitioning strategies: + +:::note +The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table. +::: + +| Partitioning strategy | Description | Json example | Java example | Equivalent SQL semantics | +|-----------------------|--------------------------------------------------------------| ---------------------------------------------------------------- | ----------------------------------------------- |-----------------------------------| +| `identity` | Source value, unmodified | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | +| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | +| `day` | Extract a date or timestamp day, as days from 1970-01-01 | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | +| `month` | Extract a date or timestamp month, as months from 1970-01-01 | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | +| `year` | Extract a date or timestamp year, as years from 1970 | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | +| `bucket[N]` | Hash of value, mod N | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | +| `truncate[W]` | Value truncated to width W | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | +| `list` | Partition the table by a list value | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | +| `range` | Partition the table by a range value | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | + +Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL. +For complex function, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). + +The following is an example of creating a partitioned table: + + + + +```json +[ + { + "strategy": "identity", + "fieldName": [ + "score" + ] + } +] +``` + + + + +```java +new Transform[] { + // Partition by score + Transforms.identity("score") + } +``` + + + + + +## Bucketed table + +- Strategy. It defines how your table data is distributed across partitions. + +| Bucket strategy | Description | Json | Java | +|-----------------|----------------------------------------------------------------------------------------------------------------------|----------|------------------| +| hash | Bucket table using hash. The data will be distributed into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` | +| range | Bucket table using range. The data will be divided into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` | +| even | Bucket table using even. The data will be evenly distributed into buckets, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` | + +- Number. It defines how many buckets you use to bucket the table. +- Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments. + +| Expression type | Json example | Java example | Equivalent SQL semantics | Description | +|-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------| +| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | +| function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | +| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | + + + + + +```json +{ + "strategy": "hash", + "number": 4, + "funcArgs": [ + { + "type": "field", + "fieldName": ["score"] + } + ] +} +``` + + + + +```java + new DistributionDTO.Builder() + .withStrategy(Strategy.HASH) + .withNumber(4) + .withArgs(FieldReferenceDTO.of("score")) + .build() +``` + + + + + +## Sorted order table + +To define a sorted order table, you should use the following three components to construct a valid sorted order table. + +- Direction. It defines in which direction we sort the table. + +| Direction | Description | Json | Java | +|------------|-------------------------------------------| ------ | -------------------------- | +| ascending | Sorted by a field or a function ascending | `asc` | `SortDirection.ASCENDING` | +| descending | Sorted by a field or a function descending| `desc` | `SortDirection.DESCENDING` | + +- Null ordering. It describes how to handle null value when ordering + +| Null ordering Type | Description | Json | Java | +|--------------------|-----------------------------------| ------------- | -------------------------- | +| null_first | Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | +| null_last | Put null value in the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | + +Note: If the direction value is `ascending`, the default ordering value is `nulls_first` and if the direction value is `descending`, the default ordering value is `nulls_last`. + +- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. + + + + +```json + { + "direction": "asc", + "nullOrder": "NULLS_LAST", + "sortTerm": { + "type": "field", + "fieldName": ["score"] + } +} +``` + + + + +```java + new SortOrderDTO.Builder() + .withDirection(SortDirection.ASCENDING) + .withNullOrder(NullOrdering.NULLS_LAST) + .withSortTerm(FieldReferenceDTO.of("score")) + .build() +``` + + + + + +:::tip +**Not all catalogs may support those features.**. Please refer to the related document for more details. +::: + +The following is an example of creating a partitioned, bucketed table and sorted order table: + + + + +```bash +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "table", + "columns": [ + { + "name": "id", + "type": "integer", + "nullable": true, + "comment": "Id of the user" + }, + { + "name": "name", + "type": "varchar(2000)", + "nullable": true, + "comment": "Name of the user" + }, + { + "name": "age", + "type": "short", + "nullable": true, + "comment": "Age of the user" + }, + { + "name": "score", + "type": "double", + "nullable": true, + "comment": "Score of the user" + } + ], + "comment": "Create a new Table", + "properties": { + "format": "ORC" + }, + "partitioning": [ + { + "strategy": "identity", + "fieldName": ["score"] + } + ], + "distribution": { + "strategy": "hash", + "number": 4, + "funcArgs": [ + { + "type": "field", + "fieldName": ["score"] + } + ] + }, + "sortOrders": [ + { + "direction": "asc", + "nullOrder": "NULLS_LAST", + "sortTerm": { + "type": "field", + "fieldName": ["name"] + } + } + ] +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables +``` + + + + +```java +tableCatalog.createTable( + NameIdentifier.of("metalake", "hive_catalog", "schema", "table"), + new ColumnDTO[] { + ColumnDTO.builder() + .withComment("Id of the user") + .withName("id") + .withDataType(Types.IntegerType.get()) + .withNullable(true) + .build(), + ColumnDTO.builder() + .withComment("Name of the user") + .withName("name") + .withDataType(Types.VarCharType.of(1000)) + .withNullable(true) + .build(), + ColumnDTO.builder() + .withComment("Age of the user") + .withName("age") + .withDataType(Types.ShortType.get()) + .withNullable(true) + .build(), + + ColumnDTO.builder() + .withComment("Score of the user") + .withName("score") + .withDataType(Types.DoubleType.get()) + .withNullable(true) + .build(), + }, + "Create a new Table", + tablePropertiesMap, + new Transform[] { + // Partition by id + Transforms.identity("score") + }, + // CLUSTERED BY id + new DistributionDTO.Builder() + .withStrategy(Strategy.HASH) + .withNumber(4) + .withArgs(FieldReferenceDTO.of("id")) + .build(), + // SORTED BY name asc + new SortOrderDTO[] { + new SortOrderDTO.Builder() + .withDirection(SortDirection.ASCENDING) + .withNullOrder(NullOrdering.NULLS_LAST) + .withSortTerm(FieldReferenceDTO.of("name")) + .build() + } + ); +``` + + + \ No newline at end of file From 6e37e148524dcda1d8122b9364b3f13f82535b29 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 25 Dec 2023 15:12:19 +0800 Subject: [PATCH 12/21] Add document header for table-partitioning-bucketing-sort-order.md --- docs/table-partitioning-bucketing-sort-order.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index 01bc6dcc077..dd0b20be74a 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -1,4 +1,13 @@ - +--- +title: "Table partitioning, bucketing and sorting order" +slug: /table-partitioning-bucketing-sort-order +date: 2023-12-25 +keyword: Table Partition Bucket Distribute Sort By +license: Copyright 2023 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; ## Partitioned table From 3f6c62268dce84cb5ff708a6091f7a2212568a3f Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 25 Dec 2023 15:22:16 +0800 Subject: [PATCH 13/21] Add descriptions about default value of sort direction. --- docs/table-partitioning-bucketing-sort-order.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index dd0b20be74a..782acabb567 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -117,7 +117,7 @@ new Transform[] { To define a sorted order table, you should use the following three components to construct a valid sorted order table. -- Direction. It defines in which direction we sort the table. +- Direction. It defines in which direction we sort the table. The default value is `ascending`. | Direction | Description | Json | Java | |------------|-------------------------------------------| ------ | -------------------------- | From 993fdff989c72b4f27a53fc6505654f743f97c58 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 25 Dec 2023 21:28:39 +0800 Subject: [PATCH 14/21] Change some improper variants naming --- ...table-partitioning-bucketing-sort-order.md | 48 ++++++++----------- 1 file changed, 20 insertions(+), 28 deletions(-) diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index 782acabb567..5f519f98150 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -17,19 +17,19 @@ Currently, Gravitino supports the following partitioning strategies: The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table. ::: -| Partitioning strategy | Description | Json example | Java example | Equivalent SQL semantics | -|-----------------------|--------------------------------------------------------------| ---------------------------------------------------------------- | ----------------------------------------------- |-----------------------------------| -| `identity` | Source value, unmodified | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | -| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` | -| `day` | Extract a date or timestamp day, as days from 1970-01-01 | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` | -| `month` | Extract a date or timestamp month, as months from 1970-01-01 | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` | -| `year` | Extract a date or timestamp year, as years from 1970 | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` | -| `bucket[N]` | Hash of value, mod N | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | -| `truncate[W]` | Value truncated to width W | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | -| `list` | Partition the table by a list value | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | -| `range` | Partition the table by a range value | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | - -Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL. +| Partitioning strategy | Description | Json example | Java example | Equivalent SQL semantics | +|-----------------------|--------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------|------------------------------------| +| `identity` | Source value, unmodified | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | +| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `{"strategy":"hour","fieldName":["dt"]}` | `Transforms.hour("dt")` | `PARTITION BY hour(dt)` | +| `day` | Extract a date or timestamp day, as days from 1970-01-01 | `{"strategy":"day","fieldName":["dt"]}` | `Transforms.day("dt")` | `PARTITION BY day(dt)` | +| `month` | Extract a date or timestamp month, as months from 1970-01-01 | `{"strategy":"month","fieldName":["dt"]}` | `Transforms.month("dt")` | `PARTITION BY month(dt)` | +| `year` | Extract a date or timestamp year, as years from 1970 | `{"strategy":"year","fieldName":["dt"]}` | `Transforms.year("dt")` | `PARTITION BY year(dt)` | +| `bucket[N]` | Hash of value, mod N | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | +| `truncate[W]` | Value truncated to width W | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | +| `list` | Partition the table by a list value | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | +| `range` | Partition the table by a range value | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | + +Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["dt"]}` is equivalent to `PARTITION BY toDate(dt)` in SQL. For complex function, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). The following is an example of creating a partitioned table: @@ -75,11 +75,11 @@ new Transform[] { - Number. It defines how many buckets you use to bucket the table. - Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments. -| Expression type | Json example | Java example | Equivalent SQL semantics | Description | -|-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------| -| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` | -| function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()
.withFunctionName("hour")
.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` | -| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()
.withValue("10")
.withDataType(Types.IntegerType.get())
.build()` | `10` | Integer constant `10` | +| Expression type | Json example | Java example | Equivalent SQL semantics | Description | +|-----------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------|-----------------------------------| +| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | The field reference value `score` | +| function | `{"type":"function","functionName":"hour","fieldName":["dt"]}` | `new FuncExpressionDTO.Builder().withFunctionName("hour").withFunctionArgs("dt").build()` | `hour(dt)` | The function value `hour(dt)` | +| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder().withValue("10").withDataType(Types.IntegerType.get()).build()` | `10` | The integer literal `10` | @@ -153,11 +153,7 @@ Note: If the direction value is `ascending`, the default ordering value is `null ```java - new SortOrderDTO.Builder() - .withDirection(SortDirection.ASCENDING) - .withNullOrder(NullOrdering.NULLS_LAST) - .withSortTerm(FieldReferenceDTO.of("score")) - .build() +SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST); ``` @@ -283,11 +279,7 @@ tableCatalog.createTable( .build(), // SORTED BY name asc new SortOrderDTO[] { - new SortOrderDTO.Builder() - .withDirection(SortDirection.ASCENDING) - .withNullOrder(NullOrdering.NULLS_LAST) - .withSortTerm(FieldReferenceDTO.of("name")) - .build() + SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST) } ); ``` From b1d3db62f7f6c11d992d8f7b7ab0c217cc247210 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 25 Dec 2023 21:31:40 +0800 Subject: [PATCH 15/21] Fix discussion again --- docs/table-partitioning-bucketing-sort-order.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index 5f519f98150..f257ee8ef46 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -27,7 +27,7 @@ The `score`, `dt`, and `city` appearing in the table below refer to the field na | `bucket[N]` | Hash of value, mod N | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | | `truncate[W]` | Value truncated to width W | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | | `list` | Partition the table by a list value | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | -| `range` | Partition the table by a range value | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | +| `range` | Partition the table by a range value | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range("dt")` | `PARTITION BY range(dt)` | Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["dt"]}` is equivalent to `PARTITION BY toDate(dt)` in SQL. For complex function, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). From 108117a0cf8cea726d9ddbde429a64944ef38b28 Mon Sep 17 00:00:00 2001 From: yuqi Date: Wed, 27 Dec 2023 10:51:43 +0800 Subject: [PATCH 16/21] Optimize code. --- docs/manage-metadata-using-gravitino.md | 3 +- ...table-partitioning-bucketing-sort-order.md | 72 +++++++++---------- 2 files changed, 38 insertions(+), 37 deletions(-) diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index c41c95b0edb..85947a1550c 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -736,7 +736,8 @@ In addition to the basic settings, Gravitino supports the following features: | Bucketed table | Equal to `CLUSTERED BY` in Apache Hive, some engine may use different words to describe it. | [Distribution](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/distributions/Distribution.html) | | Sorted order table | Equal to `SORTED BY` in Apache Hive, some engine may use different words to describe it. | [SortOrder](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/sorts/SortOrder.html) | -For More about partition, distribution and sort order, please refer to the related [doc](table-partitioning-bucketing-sort-order.md). + +For more information, please see the related document on [partitioning, bucketing, and sorting](table-partitioning-bucketing-sort-order.md). :::note The code above is an example of creating a Hive table. For other catalogs, the code is similar, but the supported column type, table properties may be different. For more details, please refer to the related doc. diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index f257ee8ef46..171e05fbc9c 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -2,7 +2,7 @@ title: "Table partitioning, bucketing and sorting order" slug: /table-partitioning-bucketing-sort-order date: 2023-12-25 -keyword: Table Partition Bucket Distribute Sort By +keyword: Table Partition Bucket Distribute Sort By license: Copyright 2023 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2. --- @@ -10,27 +10,27 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ## Partitioned table - +As well as Currently, Gravitino supports the following partitioning strategies: :::note -The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table. +The `score`, `createTime`, and `city` appearing in the table below refer to the field names in a table. ::: -| Partitioning strategy | Description | Json example | Java example | Equivalent SQL semantics | -|-----------------------|--------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------|------------------------------------| -| `identity` | Source value, unmodified | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | -| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `{"strategy":"hour","fieldName":["dt"]}` | `Transforms.hour("dt")` | `PARTITION BY hour(dt)` | -| `day` | Extract a date or timestamp day, as days from 1970-01-01 | `{"strategy":"day","fieldName":["dt"]}` | `Transforms.day("dt")` | `PARTITION BY day(dt)` | -| `month` | Extract a date or timestamp month, as months from 1970-01-01 | `{"strategy":"month","fieldName":["dt"]}` | `Transforms.month("dt")` | `PARTITION BY month(dt)` | -| `year` | Extract a date or timestamp year, as years from 1970 | `{"strategy":"year","fieldName":["dt"]}` | `Transforms.year("dt")` | `PARTITION BY year(dt)` | -| `bucket[N]` | Hash of value, mod N | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | -| `truncate[W]` | Value truncated to width W | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | -| `list` | Partition the table by a list value | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | -| `range` | Partition the table by a range value | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range("dt")` | `PARTITION BY range(dt)` | - -Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["dt"]}` is equivalent to `PARTITION BY toDate(dt)` in SQL. -For complex function, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). +| Partitioning strategy | Description | JSON example | Java example | Equivalent SQL semantics | +|-----------------------|----------------------------------------------------------------|------------------------------------------------------------------|--------------------------------------------------------|---------------------------------------| +| `identity` | Source value, unmodified. | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | +| `hour` | Extract a timestamp hour, as hours from '1970-01-01 00:00:00'. | `{"strategy":"hour","fieldName":["createTime"]}` | `Transforms.hour("createTime")` | `PARTITION BY hour(createTime)` | +| `day` | Extract a date or timestamp day, as days from '1970-01-01'. | `{"strategy":"day","fieldName":["createTime"]}` | `Transforms.day("createTime")` | `PARTITION BY day(createTime)` | +| `month` | Extract a date or timestamp month, as months from '1970-01-01' | `{"strategy":"month","fieldName":["createTime"]}` | `Transforms.month("createTime")` | `PARTITION BY month(createTime)` | +| `year` | Extract a date or timestamp year, as years from 1970. | `{"strategy":"year","fieldName":["createTime"]}` | `Transforms.year("createTime")` | `PARTITION BY year(createTime)` | +| `bucket[N]` | Hash of value, mod N. | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | +| `truncate[W]` | Value truncated to width W. | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | +| `list` | Partition the table by a list value. | `{"strategy":"list","fieldNames":[["createTime"],["city"]]}` | `Transforms.list(new String[] {"createTime", "city"})` | `PARTITION BY list(createTime, city)` | +| `range` | Partition the table by a range value. | `{"strategy":"range","fieldName":["createTime"]}` | `Transforms.range("createTime")` | `PARTITION BY range(createTime)` | + +As well as the strategies mentioned before, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["createTime"]}` is equivalent to `PARTITION BY toDate(createTime)` in SQL. +For complex functions, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). The following is an example of creating a partitioned table: @@ -64,18 +64,18 @@ new Transform[] { ## Bucketed table -- Strategy. It defines how your table data is distributed across partitions. +- Strategy. It defines how Gravitino will distribute table data across partitions. -| Bucket strategy | Description | Json | Java | -|-----------------|----------------------------------------------------------------------------------------------------------------------|----------|------------------| -| hash | Bucket table using hash. The data will be distributed into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` | -| range | Bucket table using range. The data will be divided into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` | -| even | Bucket table using even. The data will be evenly distributed into buckets, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` | +| Bucket strategy | Description | JSON | Java | +|-----------------|-------------------------------------------------------------------------------------------------------------------------------|----------|------------------| +| hash | Bucket table using hash. Gravitino will distribute table data into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` | +| range | Bucket table using range. Gravitino will distribute table data into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` | +| even | Bucket table using even. Gravitino will distribute table data, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` | - Number. It defines how many buckets you use to bucket the table. - Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments. -| Expression type | Json example | Java example | Equivalent SQL semantics | Description | +| Expression type | JSON example | Java example | Equivalent SQL semantics | Description | |-----------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------|-----------------------------------| | field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | The field reference value `score` | | function | `{"type":"function","functionName":"hour","fieldName":["dt"]}` | `new FuncExpressionDTO.Builder().withFunctionName("hour").withFunctionArgs("dt").build()` | `hour(dt)` | The function value `hour(dt)` | @@ -117,23 +117,23 @@ new Transform[] { To define a sorted order table, you should use the following three components to construct a valid sorted order table. -- Direction. It defines in which direction we sort the table. The default value is `ascending`. +- Direction. It defines in which direction Gravitino sorts the table. The default value is `ascending`. -| Direction | Description | Json | Java | -|------------|-------------------------------------------| ------ | -------------------------- | -| ascending | Sorted by a field or a function ascending | `asc` | `SortDirection.ASCENDING` | -| descending | Sorted by a field or a function descending| `desc` | `SortDirection.DESCENDING` | +| Direction | Description | JSON | Java | +|------------|---------------------------------------------| ------ | -------------------------- | +| ascending | Sorted by a field or a function ascending. | `asc` | `SortDirection.ASCENDING` | +| descending | Sorted by a field or a function descending. | `desc` | `SortDirection.DESCENDING` | -- Null ordering. It describes how to handle null value when ordering +- Null ordering. It describes how to handle null values when ordering -| Null ordering Type | Description | Json | Java | -|--------------------|-----------------------------------| ------------- | -------------------------- | -| null_first | Put null value in the first place | `nulls_first` | `NullOrdering.NULLS_FIRST` | -| null_last | Put null value in the last place | `nulls_last` | `NullOrdering.NULLS_LAST` | +| Null ordering Type | Description | JSON | Java | +|--------------------|-----------------------------------------| ------------- | -------------------------- | +| null_first | Puts the null value in the first place. | `nulls_first` | `NullOrdering.NULLS_FIRST` | +| null_last | Puts the null value in the last place. | `nulls_last` | `NullOrdering.NULLS_LAST` | Note: If the direction value is `ascending`, the default ordering value is `nulls_first` and if the direction value is `descending`, the default ordering value is `nulls_last`. -- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter. +- Sort term. It shows which field or function Gravitino uses to sort the table, please refer to the `Expression type` in the bucketed table section. @@ -164,7 +164,7 @@ SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrderi **Not all catalogs may support those features.**. Please refer to the related document for more details. ::: -The following is an example of creating a partitioned, bucketed table and sorted order table: +The following is an example of creating a partitioned, bucketed table, and sorted order table: From c0503f8fd942bdf923320d3acd93ed0eec5be8f9 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 2 Jan 2024 09:58:54 +0800 Subject: [PATCH 17/21] Fix Jerry's comments and format some code --- ...table-partitioning-bucketing-sort-order.md | 75 +++++++++---------- 1 file changed, 36 insertions(+), 39 deletions(-) diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index 171e05fbc9c..e12e5e1808c 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -9,8 +9,7 @@ license: Copyright 2023 Datastrato Pvt Ltd. This software is licensed under the import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -## Partitioned table -As well as +## Table partitioning Currently, Gravitino supports the following partitioning strategies: :::note @@ -62,7 +61,7 @@ new Transform[] { -## Bucketed table +## Table bucketing - Strategy. It defines how Gravitino will distribute table data across partitions. @@ -113,7 +112,7 @@ new Transform[] { -## Sorted order table +## Sort ordering To define a sorted order table, you should use the following three components to construct a valid sorted order table. @@ -167,9 +166,9 @@ SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrderi The following is an example of creating a partitioned, bucketed table, and sorted order table: - + -```bash +```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "table", @@ -239,49 +238,47 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ tableCatalog.createTable( NameIdentifier.of("metalake", "hive_catalog", "schema", "table"), new ColumnDTO[] { - ColumnDTO.builder() - .withComment("Id of the user") - .withName("id") - .withDataType(Types.IntegerType.get()) - .withNullable(true) - .build(), - ColumnDTO.builder() - .withComment("Name of the user") - .withName("name") - .withDataType(Types.VarCharType.of(1000)) - .withNullable(true) - .build(), - ColumnDTO.builder() - .withComment("Age of the user") - .withName("age") - .withDataType(Types.ShortType.get()) - .withNullable(true) - .build(), - - ColumnDTO.builder() - .withComment("Score of the user") - .withName("score") - .withDataType(Types.DoubleType.get()) - .withNullable(true) - .build(), + ColumnDTO.builder() + .withComment("Id of the user") + .withName("id") + .withDataType(Types.IntegerType.get()) + .withNullable(true) + .build(), + ColumnDTO.builder() + .withComment("Name of the user") + .withName("name") + .withDataType(Types.VarCharType.of(1000)) + .withNullable(true) + .build(), + ColumnDTO.builder() + .withComment("Age of the user") + .withName("age") + .withDataType(Types.ShortType.get()) + .withNullable(true) + .build(), + ColumnDTO.builder() + .withComment("Score of the user") + .withName("score") + .withDataType(Types.DoubleType.get()) + .withNullable(true) + .build(), }, "Create a new Table", tablePropertiesMap, new Transform[] { // Partition by id - Transforms.identity("score") + Transforms.identity("score") }, // CLUSTERED BY id new DistributionDTO.Builder() - .withStrategy(Strategy.HASH) - .withNumber(4) - .withArgs(FieldReferenceDTO.of("id")) - .build(), + .withStrategy(Strategy.HASH) + .withNumber(4) + .withArgs(FieldReferenceDTO.of("id")) + .build(), // SORTED BY name asc new SortOrderDTO[] { - SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST) - } - ); + SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST) + }); ``` From b993c01f3978209d528bd0159901ce41ec681145 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 2 Jan 2024 10:26:50 +0800 Subject: [PATCH 18/21] Polish docs again --- docs/manage-metadata-using-gravitino.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index 85947a1550c..fdc12b269f8 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -730,11 +730,11 @@ The following is the table property that Gravitino supports: In addition to the basic settings, Gravitino supports the following features: -| Feature | Description | Java doc | -|---------------------|----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------| -| Partitioned table | Equal to `PARTITION BY` in Apache Hive and other engine that support partitioning. | [Partition](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/dto/rel/partitions/Partitioning.html) | -| Bucketed table | Equal to `CLUSTERED BY` in Apache Hive, some engine may use different words to describe it. | [Distribution](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/distributions/Distribution.html) | -| Sorted order table | Equal to `SORTED BY` in Apache Hive, some engine may use different words to describe it. | [SortOrder](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/sorts/SortOrder.html) | +| Feature | Description | Java doc | +|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------| +| Table partitioning | Equal to `PARTITION BY` in Apache Hive, It is a partitioning strategy that is used to split a table into parts based on partition keys. Some table engine may not support this feature | [Partition](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/dto/rel/partitions/Partitioning.html) | +| Table bucketing | Equal to `CLUSTERED BY` in Apache Hive, Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files/parts, (By specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-defined number into buckets. | [Distribution](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/distributions/Distribution.html) | +| Table sort ordering | Equal to `SORTED BY` in Apache Hive, sort ordering is a method to sort the data by specific ways such as by a column or a function and then store table data. it will highly improve the query performance under certain scenarios. | [SortOrder](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/sorts/SortOrder.html) | For more information, please see the related document on [partitioning, bucketing, and sorting](table-partitioning-bucketing-sort-order.md). From a266e9594bd90150969ead06764c9b21ae9915e6 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 2 Jan 2024 14:16:41 +0800 Subject: [PATCH 19/21] 1. Add the necessary messages needed by table partitioning 2. Change the language of some code blocks from `bash` to `shell` --- docs/manage-metadata-using-gravitino.md | 80 +++++++++---------- ...table-partitioning-bucketing-sort-order.md | 13 ++- 2 files changed, 51 insertions(+), 42 deletions(-) diff --git a/docs/manage-metadata-using-gravitino.md b/docs/manage-metadata-using-gravitino.md index fdc12b269f8..0dd7e38158d 100644 --- a/docs/manage-metadata-using-gravitino.md +++ b/docs/manage-metadata-using-gravitino.md @@ -31,9 +31,9 @@ You can create a metalake by sending a `POST` request to the `/api/metalakes` en The following is an example of creating a metalake: - + -```bash +```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{"name":"metalake","comment":"comment","properties":{}}' \ http://localhost:8090/api/metalakes @@ -61,9 +61,9 @@ GravitinoMetaLake newMetalake = gravitinoClient.createMetalake( You can create a metalake by sending a `GET` request to the `/api/metalakes/{metalake_name}` endpoint or just use the Gravitino Java client. The following is an example of loading a metalake: - + -```bash +```shell curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" http://localhost:8090/api/metalakes/metalake ``` @@ -86,9 +86,9 @@ GravitinoMetaLake loaded = gravitinoClient.loadMetalake( You can modify a metalake by sending a `PUT` request to the `/api/metalakes/{metalake_name}` endpoint or just use the Gravitino Java client. The following is an example of altering a metalake: - + -```bash +```shell curl -X PUT -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "updates": [ @@ -136,9 +136,9 @@ Currently, Gravitino supports the following changes to a metalake: You can remove a metalake by sending a `DELETE` request to the `/api/metalakes/{metalake_name}` endpoint or just use the Gravitino Java client. The following is an example of dropping a metalake: - + -```bash +```shell curl -X DELETE -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" http://localhost:8090/api/metalakes/metalake ``` @@ -166,9 +166,9 @@ Drop a metalake only removes metadata about the metalake and catalogs, schemas, You can list metalakes by sending a `GET` request to the `/api/metalakes` endpoint or just use the Gravitino Java client. The following is an example of listing all metalake name: - + -```bash +```shell curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" http://localhost:8090/api/metalakes ``` @@ -198,9 +198,9 @@ The code below is an example of creating a Hive catalog. For other catalogs, the You can create a catalog by sending a `POST` request to the `/api/metalakes/{metalake_name}/catalogs` endpoint or just use the Gravitino Java client. The following is an example of creating a catalog: - + -```bash +```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "catalog", @@ -256,9 +256,9 @@ Currently, Gravitino supports the following catalog providers: You can load a catalog by sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}` endpoint or just use the Gravitino Java client. The following is an example of loading a catalog: - + -```bash +```shell curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" http://localhost:8090/api/metalakes/metalake/catalogs/catalog ``` @@ -284,9 +284,9 @@ Catalog catalog = gravitinoMetaLake.loadCatalog(NameIdentifier.of("metalake", "c You can modify a catalog by sending a `PUT` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}` endpoint or just use the Gravitino Java client. The following is an example of altering a catalog: - + -```bash +```shell curl -X PUT -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "updates": [ @@ -334,9 +334,9 @@ Currently, Gravitino supports the following changes to a catalog: You can remove a catalog by sending a `DELETE` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}` endpoint or just use the Gravitino Java client. The following is an example of dropping a catalog: - + -```bash +```shell curl -X DELETE -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" \ http://localhost:8090/api/metalakes/metalake/catalogs/catalog @@ -368,9 +368,9 @@ You can list all catalogs under a metalake by sending a `GET` request to the `/a a metalake: - + -```bash +```shell curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" \ http://localhost:8090/api/metalakes/metalake/catalogs @@ -403,9 +403,9 @@ Users should create a metalake and a catalog before creating a schema. You can create a schema by sending a `POST` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas` endpoint or just use the Gravitino Java client. The following is an example of creating a schema: - + -```bash +```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "schema", @@ -460,9 +460,9 @@ Currently, Gravitino supports the following schema property: You can create a schema by sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}` endpoint or just use the Gravitino Java client. The following is an example of loading a schema: - + -```bash +```shell curl -X GET \-H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" \ http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema @@ -488,9 +488,9 @@ Schema schema = supportsSchemas.loadSchema(NameIdentifier.of("metalake", "catalo You can change a schema by sending a `PUT` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}` endpoint or just use the Gravitino Java client. The following is an example of modifying a schema: - + -```bash +```shell curl -X PUT -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "updates": [ @@ -536,9 +536,9 @@ Currently, Gravitino supports the following changes to a schema: You can remove a schema by sending a `DELETE` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}` endpoint or just use the Gravitino Java client. The following is an example of dropping a schema: - + -```bash +```shell // cascade can be true or false curl -X DELETE -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" \ @@ -571,9 +571,9 @@ You can alter all schemas under a catalog by sending a `GET` request to the `/ap - + -```bash +```shell curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas ``` @@ -604,9 +604,9 @@ Users should create a metalake, a catalog and a schema before creating a table. You can create a table by sending a `POST` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables` endpoint or just use the Gravitino Java client. The following is an example of creating a table: - + -```bash +```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "table", @@ -748,9 +748,9 @@ The code above is an example of creating a Hive table. For other catalogs, the c You can load a table by sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{table_name}` endpoint or just use the Gravitino Java client. The following is an example of loading a table: - + -```bash +```shell curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" \ http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table @@ -777,9 +777,9 @@ tableCatalog.loadTable(NameIdentifier.of("metalake", "hive_catalog", "schema", " You can modify a table by sending a `PUT` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{table_name}` endpoint or just use the Gravitino Java client. The following is an example of modifying a table: - + -```bash +```shell curl -X PUT -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "updates": [ @@ -834,9 +834,9 @@ Currently, Gravitino supports the following changes to a table: You can remove a table by sending a `DELETE` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{table_name}` endpoint or just use the Gravitino Java client. The following is an example of dropping a table: - + -```bash +```shell ## purge can be true or false, if purge is true, Gravitino will remove the data of the table. curl -X DELETE -H "Accept: application/vnd.gravitino.v1+json" \ @@ -873,9 +873,9 @@ Apache Hive support both, `dropTable` will only remove the metadata of a table a You can list all tables in a schema by sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables` endpoint or just use the Gravitino Java client. The following is an example of list all tables in a schema: - + -```bash +```shell curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" \ http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index e12e5e1808c..cd0263db0cb 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -10,7 +10,10 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ## Table partitioning -Currently, Gravitino supports the following partitioning strategies: + +To create a partitioned table, you should provide the following two components to construct a valid partitioned table. + +- Partitioning strategy. It defines how Gravitino will distribute table data across partitions. currently Gravitino supports the following partitioning strategies. :::note The `score`, `createTime`, and `city` appearing in the table below refer to the field names in a table. @@ -31,6 +34,10 @@ The `score`, `createTime`, and `city` appearing in the table below refer to the As well as the strategies mentioned before, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["createTime"]}` is equivalent to `PARTITION BY toDate(createTime)` in SQL. For complex functions, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). +- Field names: It defines which fields Gravitino uses to partition the table. + +- Other messages may also be needed. For example, if the partitioning strategy is `bucket`, you should provide the number of buckets; if the partitioning strategy is `truncate`, you should provide the width of the truncate. + The following is an example of creating a partitioned table: @@ -63,6 +70,8 @@ new Transform[] { ## Table bucketing +To create a bucketed table, you should use the following three components to construct a valid bucketed table. + - Strategy. It defines how Gravitino will distribute table data across partitions. | Bucket strategy | Description | JSON | Java | @@ -132,7 +141,7 @@ To define a sorted order table, you should use the following three components to Note: If the direction value is `ascending`, the default ordering value is `nulls_first` and if the direction value is `descending`, the default ordering value is `nulls_last`. -- Sort term. It shows which field or function Gravitino uses to sort the table, please refer to the `Expression type` in the bucketed table section. +- Sort term. It shows which field or function Gravitino uses to sort the table, please refer to the `Function arguments` in the table bucketing section. From cc5c45439da7139111f2dc2caf1964c86dfeb036 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 2 Jan 2024 15:43:03 +0800 Subject: [PATCH 20/21] Change to use api method --- docs/table-partitioning-bucketing-sort-order.md | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-) diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index cd0263db0cb..426ce7b81f6 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -110,11 +110,7 @@ To create a bucketed table, you should use the following three components to con ```java - new DistributionDTO.Builder() - .withStrategy(Strategy.HASH) - .withNumber(4) - .withArgs(FieldReferenceDTO.of("score")) - .build() +Distributions.of(Strategy.HASH, 4, NamedReference.field("score")); ``` @@ -279,14 +275,11 @@ tableCatalog.createTable( Transforms.identity("score") }, // CLUSTERED BY id - new DistributionDTO.Builder() - .withStrategy(Strategy.HASH) - .withNumber(4) - .withArgs(FieldReferenceDTO.of("id")) - .build(), + Distributions.of(Strategy.HASH, 4, NamedReference.field("id")),, // SORTED BY name asc - new SortOrderDTO[] { - SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST) + new SortOrder[] { + SortOrders.of( + NamedReference.field("age"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST), }); ``` From 983dbab99a33275280155bfb3ffc7907e5b73e8f Mon Sep 17 00:00:00 2001 From: Jerry Shao Date: Tue, 2 Jan 2024 22:41:10 +0800 Subject: [PATCH 21/21] Update table-partitioning-bucketing-sort-order.md --- docs/table-partitioning-bucketing-sort-order.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/table-partitioning-bucketing-sort-order.md b/docs/table-partitioning-bucketing-sort-order.md index 426ce7b81f6..638e32f9a73 100644 --- a/docs/table-partitioning-bucketing-sort-order.md +++ b/docs/table-partitioning-bucketing-sort-order.md @@ -1,5 +1,5 @@ --- -title: "Table partitioning, bucketing and sorting order" +title: "Table partitioning, bucketing and sort ordering" slug: /table-partitioning-bucketing-sort-order date: 2023-12-25 keyword: Table Partition Bucket Distribute Sort By @@ -284,4 +284,4 @@ tableCatalog.createTable( ``` - \ No newline at end of file +