-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(glue): add support for partition indexes #17589
Comments
This isn't something that can be done via cloudformation right? If so, please link me to the corresponding cloudformation resource as I did not find any. It still can be done in the CDK however, though it requires use of a |
I could not find anything in cloudformation docs, no. Is there a good tutorial for doing that via a custom resource? |
To unblock yourself, take a look at the documentation for This particular issue is on my radar and if we're lucky I'll be able to submit a PR for it this week. |
This PR adds support for creating partition indexes on tables via custom resources. It offers two different ways to create indexes: ```ts // via table definition const table = new glue.Table(this, 'Table', { database, bucket, tableName: 'table', columns, partitionKeys, partitionIndexes: [{ indexName: 'my-index', keyNames: ['month'], }], dataFormat: glue.DataFormat.CSV, }); ``` ```ts // or as a function table.AddPartitionIndex([{ indexName: 'my-other-index', keyNames: ['month', 'year'], }); ``` I also refactored the format of some tests, which is what accounts for the large diff in `test.table.ts`. Motivation: Creating partition indexes on a table is something you can do via the console, but is not an exposed property in cloudformation. In this case, I think it makes sense to support this feature via custom resources as it will significantly reduce the customer pain of either provisioning a custom resource with correct permissions or manually going into the console after resource creation. Supporting this feature allows for synth-time checks and dependency chaining for multiple indexes (reason detailed in the FAQ) which removes a rather sharp edge for users provisioning custom resource indexes themselves. FAQ: Why do we need to chain dependencies between different Partition Index Custom Resources? - Because Glue only allows 1 index to be created or deleted simultaneously per table. Without dependencies the resources will try to create partition indexes simultaneously and the second sdk call with be dropped. Why is it called `partitionIndexes`? Is that really how you pluralize index? - [Yesish](https://www.nasdaq.com/articles/indexes-or-indices-whats-the-deal-2016-05-12). If you hate it it can be `partitionIndices`. Why is `keyNames` of type `string[]` and not `Column[]`? `PartitionKey` is of type `Column[]` and partition indexes must be a subset of partition keys... - This could be a debate. But my argument is that the pattern I see for defining a Table is to define partition keys inline and not declare them each as variables. It would be pretty clunky from a UX perspective: ```ts const key1 = { name: 'mykey', type: glue.Schema.STRING }; const key2 = { name: 'mykey2', type: glue.Schema.STRING }; const key3 = { name: 'mykey3', type: glue.Schema.STRING }; new glue.Table(this, 'table', { database, bucket, tableName: 'table', columns, partitionKeys: [key1, key2, key3], partitionIndexes: [key1, key2], dataFormat: glue.DataFormat.CSV, }); ``` Why are there 2 different checks for having > 3 partition indexes? - It's possible someone decides to define 3 indexes in the definition and then try to add another with `table.addPartitionIndex()`. This would be a nasty deploy time error, its better if it is synth time. It's also possible someone decides to define 4 indexes in the definition. It's better to fast-fail here before we create 3 custom resources. What if I deploy a table, manually add 3 partition indexes, and then try to call `table.addPartitionIndex()` and update the stack? Will that still be a synth time failure? - Sorry, no. Why do we need to generate names? - We don't. I just thought it would be helpful. Why is `grantToUnderlyingResources` public? - I thought it would be helpful. Some permissions need to be added to the table, the database, and the catalog. Closes #17589. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
|
This PR adds support for creating partition indexes on tables via custom resources. It offers two different ways to create indexes: ```ts // via table definition const table = new glue.Table(this, 'Table', { database, bucket, tableName: 'table', columns, partitionKeys, partitionIndexes: [{ indexName: 'my-index', keyNames: ['month'], }], dataFormat: glue.DataFormat.CSV, }); ``` ```ts // or as a function table.AddPartitionIndex([{ indexName: 'my-other-index', keyNames: ['month', 'year'], }); ``` I also refactored the format of some tests, which is what accounts for the large diff in `test.table.ts`. Motivation: Creating partition indexes on a table is something you can do via the console, but is not an exposed property in cloudformation. In this case, I think it makes sense to support this feature via custom resources as it will significantly reduce the customer pain of either provisioning a custom resource with correct permissions or manually going into the console after resource creation. Supporting this feature allows for synth-time checks and dependency chaining for multiple indexes (reason detailed in the FAQ) which removes a rather sharp edge for users provisioning custom resource indexes themselves. FAQ: Why do we need to chain dependencies between different Partition Index Custom Resources? - Because Glue only allows 1 index to be created or deleted simultaneously per table. Without dependencies the resources will try to create partition indexes simultaneously and the second sdk call with be dropped. Why is it called `partitionIndexes`? Is that really how you pluralize index? - [Yesish](https://www.nasdaq.com/articles/indexes-or-indices-whats-the-deal-2016-05-12). If you hate it it can be `partitionIndices`. Why is `keyNames` of type `string[]` and not `Column[]`? `PartitionKey` is of type `Column[]` and partition indexes must be a subset of partition keys... - This could be a debate. But my argument is that the pattern I see for defining a Table is to define partition keys inline and not declare them each as variables. It would be pretty clunky from a UX perspective: ```ts const key1 = { name: 'mykey', type: glue.Schema.STRING }; const key2 = { name: 'mykey2', type: glue.Schema.STRING }; const key3 = { name: 'mykey3', type: glue.Schema.STRING }; new glue.Table(this, 'table', { database, bucket, tableName: 'table', columns, partitionKeys: [key1, key2, key3], partitionIndexes: [key1, key2], dataFormat: glue.DataFormat.CSV, }); ``` Why are there 2 different checks for having > 3 partition indexes? - It's possible someone decides to define 3 indexes in the definition and then try to add another with `table.addPartitionIndex()`. This would be a nasty deploy time error, its better if it is synth time. It's also possible someone decides to define 4 indexes in the definition. It's better to fast-fail here before we create 3 custom resources. What if I deploy a table, manually add 3 partition indexes, and then try to call `table.addPartitionIndex()` and update the stack? Will that still be a synth time failure? - Sorry, no. Why do we need to generate names? - We don't. I just thought it would be helpful. Why is `grantToUnderlyingResources` public? - I thought it would be helpful. Some permissions need to be added to the table, the database, and the catalog. Closes aws#17589. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Description
Glue supports partition indexes on tables to speed up queries, but the CDK has no support for creating them. I can manually create them via the AWS cli or in the console afterwards but would prefer to manage them via the CDK.
Use Case
Managing glue tables via the CDK and benefiting from increased query speed due to partition indexes.
Proposed Solution
N/A
Other information
No response
Acknowledge
The text was updated successfully, but these errors were encountered: