Skip to content

Commit

Permalink
[#1859] docs: Add user doc for partition management (#2029)
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

Add user doc for partition management

### Why are the changes needed?

Fix: #1859 

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

local tested
  • Loading branch information
mchades authored Feb 4, 2024
1 parent 941eeeb commit ad25e8c
Show file tree
Hide file tree
Showing 2 changed files with 381 additions and 0 deletions.
379 changes: 379 additions & 0 deletions docs/manage-table-partition-using-gravitino.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,379 @@
---
title: "Manage table partition using Gravitino"
slug: /manage-table-partition-using-gravitino
date: 2024-02-03
keyword: table partition management
license: Copyright 2024 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2.
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

## Introduction

Although many catalogs inherently manage partitions automatically, there are scenarios where manual partition management is necessary. Usage scenarios like managing the TTL (Time-To-Live) of partition data, gathering statistics on partition metadata, and optimizing queries through partition pruning. For these reasons, Gravitino provides capabilities of partition management.

### Requirements and limitations

- Partition management is based on the partitioned table, so please ensure that you are operating on a partitioned table.

The following table shows the partition operations supported across various catalogs in Gravitino:

| Operation | Hive catalog | Iceberg catalog | Jdbc-Mysql catalog | Jdbc-PostgreSQL catalog |
|-----------------------|--------------------------------------------------------------------|--------------------------------------------------------------------|--------------------|-------------------------|
| Add Partition | YES | NO | NO | NO |
| Get Partition by Name | YES | NO | NO | NO |
| List Partition Names | YES | NO | NO | NO |
| List Partitions | YES | NO | NO | NO |
| Drop Partition | [Coming Soon](https://github.com/datastrato/gravitino/issues/1655) | [Coming Soon](https://github.com/datastrato/gravitino/issues/1655) | NO | NO |

:::tip[WELCOME FEEDBACK]
If you need additional partition management support for a specific catalog, please feel free to [create an issue](https://github.com/datastrato/gravitino/issues/new/choose) on the [Gravitino repository](https://github.com/datastrato/gravitino).
:::

## Partition operations

### Add partition

You must match the partition types you want to add with the table's [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning) types; Gravitino currently supports adding the following partition types:

| Partition Type | Description |
|----------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| identity | An identity partition represents a result of identity [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning). |
| range | A range partition represents a result of range [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning). |
| list | A list partition represents a result of list [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning). |

For JSON examples:

<Tabs groupId="partitions">
<TabItem value="identity" label="identity">

```json
{
"type": "identity",
"name": "dt=2008-08-08/country=us",
"fieldNames": [
[
"dt"
],
[
"country"
]
],
"values": [
{
"type": "literal",
"dataType": "date",
"value": "2008-08-08"
},
{
"type": "literal",
"dataType": "string",
"value": "us"
}
]
}
```

:::note
The values of the field `values` must be the same ordering as the values of `fieldNames`.

When adding an identity partition to a partitioned Hive table, the specified partition name is ignored. This is because Hive generates the partition name based on field names and values.
:::

</TabItem>
<TabItem value="range" label="range">

```json
{
"type": "range",
"name": "p20200321",
"upper": {
"type": "literal",
"dataType": "date",
"value": "2020-03-21"
},
"lower": {
"type": "literal",
"dataType": "null",
"value": "null"
}
}
```

</TabItem>
<TabItem value="list" label="list">

```json
{
"type": "list",
"name": "p202204_California",
"lists": [
[
{
"type": "literal",
"dataType": "date",
"value": "2022-04-01"
},
{
"type": "literal",
"dataType": "string",
"value": "Los Angeles"
}
],
[
{
"type": "literal",
"dataType": "date",
"value": "2022-04-01"
},
{
"type": "literal",
"dataType": "string",
"value": "San Francisco"
}
]
]
}
```

:::note
Each list in the lists must have the same length. The values in each list must correspond to the field definitions in the list [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning).
:::

</TabItem>
</Tabs>

For Java examples:

<Tabs groupId="partitions">
<TabItem value="identity" label="Identity">

```java
Partition partition =
Partitions.identity(
"dt=2008-08-08/country=us",
new String[][] {{"dt"}, {"country"}},
new Literal[] {
Literals.dateLiteral(LocalDate.parse("2008-08-08")), Literals.stringLiteral("us")
},
Maps.newHashMap());
```

:::note
The values are in the same order as the field names.

When adding an identity partition to a partitioned Hive table, the specified partition name is ignored. This is because Hive generates the partition name based on field names and values.
:::

</TabItem>
<TabItem value="range" label="Range">

```java
Partition partition =
Partitions.range(
"p20200321",
Literals.dateLiteral(LocalDate.parse("2020-03-21")),
Literals.NULL,
Maps.newHashMap());
```

</TabItem>

<TabItem value="list" label="List">

```java
Partition partition =
Partitions.list(
"p202204_California",
new Literal[][] {
{
Literals.dateLiteral(LocalDate.parse("2022-04-01")),
Literals.stringLiteral("Los Angeles")
},
{
Literals.dateLiteral(LocalDate.parse("2022-04-01")),
Literals.stringLiteral("San Francisco")
}
},
Maps.newHashMap());
```

:::note
Each list in the lists must have the same length. The values in each list must correspond to the field definitions in the list [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning).
:::

</TabItem>
</Tabs>

You can add a partition to a partitioned table by sending a `POST` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{partitioned_table_name}/partitions` endpoint or by using the Gravitino Java client.
The following is an example of adding a identity partition to a Hive partitioned table:

<Tabs>
<TabItem value="shell" label="Shell">

```shell
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"partitions": [
{
"type": "identity",
"fieldNames": [
[
"dt"
],
[
"country"
]
],
"values": [
{
"type": "literal",
"dataType": "date",
"value": "2008-08-08"
},
{
"type": "literal",
"dataType": "string",
"value": "us"
}
]
}
]
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions
```

</TabItem>
<TabItem value="java" label="Java">

```java
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://127.0.0.1:8090")
.build();

// Assume that you have a partitioned table named "metalake.catalog.schema.table".
Partition addedPartition =
gravitinoClient
.loadMetalake(NameIdentifier.of("metalake"))
.loadCatalog(NameIdentifier.of("metalake", "catalog"))
.asTableCatalog()
.loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table"))
.supportPartitions()
.addPartition(
Partitions.identity(
new String[][] {{"dt"}, {"country"}},
new Literal[] {
Literals.dateLiteral(LocalDate.parse("2008-08-08")), Literals.stringLiteral("us")},
Maps.newHashMap()));
```

</TabItem>
</Tabs>

### Get a partition by name

You can get a partition by its name via sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{partitioned_table_name}/partitions/{partition_name}` endpoint or by using the Gravitino Java client.
The following is an example of getting a partition by its name:

<Tabs>
<TabItem value="shell" label="Shell">

```shell
curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" \
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions/p20200321
```

:::tip
If the partition name contains special characters, you should use [URL encoding](https://en.wikipedia.org/wiki/Percent-encoding#Reserved_characters). For example, if the partition name is `dt=2008-08-08/country=us` you should use `dt%3D2008-08-08%2Fcountry%3Dus` in the URL.
:::

</TabItem>
<TabItem value="java" label="Java">

```java
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://127.0.0.1:8090")
.build();

// Assume that you have a partitioned table named "metalake.catalog.schema.table".
Partition Partition =
gravitinoClient
.loadMetalake(NameIdentifier.of("metalake"))
.loadCatalog(NameIdentifier.of("metalake", "catalog"))
.asTableCatalog()
.loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table"))
.supportPartitions()
.getPartition("partition_name");
```

</TabItem>
</Tabs>

### List partition names under a partitioned table

You can list all partition names under a partitioned table by sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{partitioned_table_name}/partitions` endpoint or by using the Gravitino Java client.
The following is an example of listing all partition names under a partitioned table:

<Tabs>
<TabItem value="shell" label="Shell">

```shell
curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" \
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions
```

</TabItem>
<TabItem value="java" label="Java">

```java
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://127.0.0.1:8090")
.build();

// Assume that you have a partitioned table named "metalake.catalog.schema.table".
String[] partitionNames =
gravitinoClient
.loadMetalake(NameIdentifier.of("metalake"))
.loadCatalog(NameIdentifier.of("metalake", "catalog"))
.asTableCatalog()
.loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table"))
.supportPartitions()
.listPartitionNames();
```

</TabItem>
</Tabs>

### List partitions under a partitioned table

If you want to get more detailed information about the partitions under a partitioned table, you can list all partitions under a partitioned table by sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{partitioned_table_name}/partitions` endpoint or by using the Gravitino Java client.
The following is an example of listing all partitions under a partitioned table:

<Tabs>
<TabItem value="shell" label="Shell">

```shell
curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" \
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions?details=true
```

</TabItem>
<TabItem value="java" label="Java">

```java
// Assume that you have a partitioned table named "metalake.catalog.schema.table".
Partition[] partitions =
gravitinoClient
.loadMetalake(NameIdentifier.of("metalake"))
.loadCatalog(NameIdentifier.of("metalake", "catalog"))
.asTableCatalog()
.loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table"))
.supportPartitions()
.listPartitions();
```

</TabItem>
</Tabs>
2 changes: 2 additions & 0 deletions docs/table-partitioning-bucketing-sort-order-indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ For function partitioning, you should provide the function name and the function

- In some cases, you require other information. For example, if the partitioning strategy is `bucket`, you should provide the number of buckets; if the partitioning strategy is `truncate`, you should provide the width of the truncate.

Once a partitioned table is created, you can [manage its partitions using Gravitino](./manage-table-partition-using-gravitino.md).

## Table bucketing

To create a bucketed table, you should use the following three components to construct a valid bucketed table.
Expand Down

0 comments on commit ad25e8c

Please sign in to comment.