Add Enumerated Data Types #4051

davisp · 2023-04-20T22:36:03Z

This PR adds the Enumerated data types. Enumerated data types work by
adding an Enumeration to the ArraySchema, setting an enumeration name on
an attribute, and then adding the attribute to the ArraySchema.

An Enumeration object contains a short list of options and a vector of
values. An attribute that has an enumeration name set must have an
integral type that is wide enough to index all of the enumerated values.

Changes to the values of an enumeration (any of adding, renaming, or
removing) can be accomplished via ArraySchemaEvolution.

TYPE: FEATURE
DESC: Enumerated data types

davisp · 2023-04-20T22:41:25Z

@KiterLuc Could you take a look at the enumeration loading logic here:

https://github.com/TileDB-Inc/TileDB/pull/4051/files#diff-e19b0b9e093512958297fb53f9e1ac2e0f8ac7320c0ce5ea19794a67e3acc4f8R849-R874

(Link is to the changes in tiledb/sm/query/query.cc)

I'm pretty sure this will run on the cloud side of a remote query so it should work? I'm mostly not sure how this would interact when we have to submit a query multiple times before completion. While I think it should work, I'm also not sure there's not a better place for that logic.

tiledb/sm/c_api/tiledb_experimental.h

KiterLuc

Initial feedback while scanning to load everything into memory... I'll do another pass focussing more on storage format on the next revision.

test/src/unit-cppapi-enumerations.cc

tiledb/api/c_api/enumeration/enumeration_api.cc

KiterLuc · 2023-04-21T07:32:45Z

test/src/unit-cppapi-enumerations.cc

+TEST_CASE("C++ API: Enumeration creation fixed size", "[cppapi][enumeration]") {
+  TestData td;
+
+  std::vector<uint32_t> values = {1, 2, 3, 4, 5};


NIT: I really like how all test cases have these 'paragraph'. Would it be possible to add a one liner comment to each so that it's easy to scan the file and see what each test case intends on doing without reading the whole code? Later down the line it makes it easier to maintain the tests or scan for missing coverage.

These tests will probably change drastically. I just did the first thing that came to mind to test while I was developing. I like to write the implementation and then come back later with my brain in testing mode to write out a comprehensive test suite.

I've massively rewritten this test suite so its completely different so I'm going to let you check and see if its documented enough to your liking. The bulk of these tests are moved to unit-enumerations.cc which uses the internal C++ APIs (as opposed to the API in tiledb/sm/cpp_api/). All of the tests are short and have descriptive names that describe what's being tested.

tiledb/sm/cpp_api/array_schema_experimental.h

tiledb/sm/cpp_api/array_experimental.h

tiledb/sm/cpp_api/enumeration_experimental.h

tiledb/sm/array_schema/enumeration.h

tiledb/api/c_api/enumeration/enumeration_api.cc

tiledb/sm/array_schema/enumeration.h

davisp · 2023-05-09T22:50:55Z

@KiterLuc It took me longer to address you're earlier feed back about all the magic number conversions in query_ast.cc so I didn't get nearly as far as I was planning today.

I've also not gone through and address all of your earlier comments around things like documentation, however I'm pretty sure all of the magic numbers and things like UINT32_MAX instead of TILEDB_VAR_NUM have all been fixed (though I haven't audited to be absolutely sure). Other than that, there's a couple missing internal APIs exposed to the C and CPP APIs.

Re-reading the storage stuff and then going through test/src/unit-enumerations.cc should hopefully be a pretty comprehensive overview of the core behavior where everything else is basically just bookkeeping and plumbing.

davisp · 2023-05-09T22:56:03Z

Also, here's my current coverage report as of a few seconds before posting this. N.B. Gists truncate after the first 1M bytes so the last few diff's aren't included.

https://gistpreview.github.io/?95c129f102d34cda4f38e7ff9331b84e

scripts/generate-coverage-report.py

test/src/unit-cppapi-enumerations.cc

test/src/unit-misc-util-safe-integral-casts.cc

KiterLuc · 2023-05-15T14:19:05Z

format_spec/enumeration.md

+
+| **Field** | **Type** | **Description** |
+| :--- | :--- | :--- |
+| Version number | `uint32_t` | Format version number of the generic tile |


Is this accurate? The format version of the generic tile should be in the generic tile header... So I'm not sure why it would be included here again?

This was implemented quite poorly before by attempting to re-use the format_version as the enumerations version. Since then I've split the naming and versioning so that things are more clear.

You were quite right that the format version is already part of the GenericTileIO header so if we really needed it, we could have gotten it there.

This however is a version for the Enumeration data itself. It now starts at 0, and if we ever have to change this (de)?serialization implementation for the contents of the GenericTileIO "payload" we can do that by using this Enumerations specific version.

tiledb/sm/array_schema/array_schema.h

tiledb/sm/array_schema/array_schema_evolution.h

tiledb/sm/array_schema/enumeration.h

tiledb/sm/array_schema/enumeration.cc

davisp · 2023-07-07T21:16:51Z

@teo-tsirpanis You mean just drop something into the examples directory to show how it works?

teo-tsirpanis · 2023-07-07T21:29:52Z

@davisp yes. You can do it later if it's too much work but would be nice to have.

eddelbuettel · 2023-07-07T21:36:51Z

@teo-tsirpanis I tossed a simple example in one our channels yesterday, I will DM you that. It would be better for @davisp to add an official 'blessed' example or two as the API changed a little and got 'richer'. I may be behind the curve.

teo-tsirpanis · 2023-07-07T22:09:23Z

tiledb/sm/c_api/tiledb_experimental.h

+/**
+ * Retrieves an attribute's enumeration given the attribute name (key).
+ *
+ * **Example:**
+ *
+ * The following retrieves the first attribute in the schema.
+ *
+ * @code{.c}
+ * tiledb_attribute_t* attr;
+ * tiledb_array_schema_get_enumeration(
+ *     ctx, array_schema, "attr_0", &enumeration);
+ * // Make sure to delete the retrieved attribute in the end.
+ * @endcode
+ *
+ * @param ctx The TileDB context.
+ * @param array The TileDB array.
+ * @param name The name (key) of the attribute from which to
+ * retrieve the enumeration.
+ * @param enumeration The enumeration object to retrieve.
+ * @return `TILEDB_OK` for success and `TILEDB_ERR` for error.
+ */
+TILEDB_EXPORT capi_return_t tiledb_array_get_enumeration(
+    tiledb_ctx_t* ctx,
+    const tiledb_array_t* array,
+    const char* name,
+    tiledb_enumeration_t** enumeration) TILEDB_NOEXCEPT;
+
+/**
+ * Load all enumerations for the array.
+ *
+ * **Example:**
+ *
+ * @code{.c}
+ * tiledb_array_load_all_enumerations(ctx, array);
+ * @endcode
+ *
+ * @param ctx The TileDB context.
+ * @param array The TileDB array.
+ * @param latest_only If non-zero, only load enumerations for the latest schema.
+ * @return `TILEDB_OK` for success and `TILEDB_ERR` for error.
+ */
+TILEDB_EXPORT capi_return_t tiledb_array_load_all_enumerations(
+    tiledb_ctx_t* ctx,
+    const tiledb_array_t* array,
+    int latest_only) TILEDB_NOEXCEPT;
+


Why are these functions part of the array API? Shouldn't they belong to array schema?

The ArraySchema has no access to the ContextResources/ArrayDirectory used for actually loading things from disk so I added it to the Array instead of forcing users to do something along the lines of schema->load_enumeration(array->array_directory()) or some such.

davisp · 2023-07-10T17:23:23Z

@teo-tsirpanis I've added an example here: befe066

Let me know if you'd like to add anything else to it.

teo-tsirpanis · 2023-07-10T18:01:30Z

Thanks!

test/src/unit-capi-enumerations.cc

KiterLuc · 2023-07-11T08:15:18Z

test/src/unit-capi-enumerations.cc

+
+#include <iostream>
+
+TEST_CASE(


Is there any reason these tests can't live in tiledb/api/capi?

These are all testing API's from the Attribute, ArraySchema, and Array classes which aren't yet migrated to the new API subdirectory. Accidentally replied in the wrong spot yesterday, so this comment is moved.

format_spec/enumeration.md

test/src/unit-cppapi-enumerations.cc

test/src/unit-enum-helpers.cc

tiledb/sm/array_schema/enumeration.h

tiledb/sm/cpp_api/enumeration_experimental.h

This PR adds the Enumerated data types. Enumerated data types work by adding an Enumeration to the ArraySchema, setting an enumeration name on an attribute, and then adding the attribute to the ArraySchema. An Enumeration object contains a short list of options and a vector of values. An attribute that has an enumeration name set must have an integral type that is wide enough to index all of the enumerated values. Changes to the values of an enumeration (any of adding, renaming, or removing) can be accomplished via ArraySchemaEvolution.

…ce with standalone link policy PR #4051 took object library `generic_tile_io` out of conformance with the policy that each OL should link standalone. This PR corrects this. Note: In its present state this PR is not suitable for review or merge. It's branched from a branch that itself has not merged yet and needs to be rebased before review is feasible.

…ne link policy PR #4051 took object library `generic_tile_io` out of conformance with the policy that each OL should link standalone. This PR corrects this.

…ne link policy (#4975) PR #4051 took object library `generic_tile_io` out of conformance with the policy that each OL should link standalone. This PR corrects this. [sc-47341] --- TYPE: NO_HISTORY DESC: Bring object library `generic_tile_io` into conformance with standalone link policy

[SC-51428](https://app.shortcut.com/tiledb-inc/story/51428/enumeration-path-map-does-not-exist-in-the-array-schema-format-spec) I noticed that the array schema format specification does not include the enumeration name-path map introduced in #4051. This PR updates the documentation. I used the term "enumeration filename" to describe the string written after the enumeration name because [it is just the file's name](https://github.com/TileDB-Inc/TileDB/blob/78ac1d2ec338fd468eb63481e85049215908e39f/tiledb/sm/array/array_directory.cc#L1324-L1326), and updated previous usages of "enumeration pathname" or "enumeration URI" in code. --- TYPE: NO_HISTORY DESC: Added documentation for the enumeration path map in array scehmas, present since format version 20.

davisp requested a review from KiterLuc April 20, 2023 22:36

teo-tsirpanis reviewed Apr 21, 2023

View reviewed changes

tiledb/sm/c_api/tiledb_experimental.h Outdated Show resolved Hide resolved

KiterLuc reviewed Apr 21, 2023

View reviewed changes

teo-tsirpanis reviewed Apr 21, 2023

View reviewed changes

tiledb/api/c_api/enumeration/enumeration_api.cc Outdated Show resolved Hide resolved

teo-tsirpanis reviewed Apr 24, 2023

View reviewed changes

tiledb/sm/array_schema/enumeration.h Outdated Show resolved Hide resolved

davisp force-pushed the pd/experiment/enums branch 4 times, most recently from c15874e to a8f12ea Compare May 5, 2023 21:38

davisp force-pushed the pd/experiment/enums branch 2 times, most recently from 5ae0fcf to 4928106 Compare May 9, 2023 22:46

davisp force-pushed the pd/experiment/enums branch from 4928106 to 16ef8d2 Compare May 11, 2023 21:51

davisp requested a review from KiterLuc May 12, 2023 19:39

davisp force-pushed the pd/experiment/enums branch 4 times, most recently from 32a8576 to 4c08bb3 Compare May 15, 2023 16:31

KiterLuc reviewed May 15, 2023

View reviewed changes

davisp force-pushed the pd/experiment/enums branch 9 times, most recently from 7937d8f to 534d4a2 Compare June 14, 2023 20:42

teo-tsirpanis reviewed Jul 7, 2023

View reviewed changes

KiterLuc reviewed Jul 12, 2023

View reviewed changes

KiterLuc approved these changes Jul 13, 2023

View reviewed changes

KiterLuc force-pushed the pd/experiment/enums branch from 7a226f2 to 5e509ba Compare July 14, 2023 08:28

davisp force-pushed the pd/experiment/enums branch 5 times, most recently from 18ac027 to 3c93077 Compare July 19, 2023 20:50

davisp force-pushed the pd/experiment/enums branch from 3c93077 to 1bd816f Compare July 20, 2023 15:36

davisp merged commit c0d7c6a into dev Jul 20, 2023

ihnorton deleted the pd/experiment/enums branch July 24, 2023 12:53

ihnorton mentioned this pull request Jul 28, 2023

Filter pipeline support for datatype conversions based on filtered output datatype. #4165

Merged

eddelbuettel mentioned this pull request Jul 31, 2023

Support for enumerated types TileDB-Inc/TileDB-R#562

Merged

anastasop mentioned this pull request Sep 4, 2023

Add enumerations TileDB-Inc/TileDB-Go#269

Merged

eric-hughes-tiledb mentioned this pull request May 1, 2024

Fix the documentation for the array_schema object library #4935

Merged

eric-hughes-tiledb mentioned this pull request May 13, 2024

[NOT FOR MERGE] Bring object library generic_tile_io into conformance with standalone link policy #4972

Closed

eric-hughes-tiledb mentioned this pull request May 14, 2024

Bring object library generic_tile_io into conformance with standalone link policy #4975

Merged

This was referenced Jul 19, 2024

Document enumeration path map in the spec. #5203

Merged

Add a history of storage format versions et. al. #5205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Enumerated Data Types #4051

Add Enumerated Data Types #4051

davisp commented Apr 20, 2023 •

edited

Loading

davisp commented Apr 20, 2023 •

edited

Loading

KiterLuc left a comment

KiterLuc Apr 21, 2023

davisp Apr 21, 2023

davisp May 12, 2023

davisp commented May 9, 2023 •

edited

Loading

davisp commented May 9, 2023

KiterLuc May 15, 2023

davisp Jul 5, 2023

davisp commented Jul 7, 2023

teo-tsirpanis commented Jul 7, 2023

eddelbuettel commented Jul 7, 2023

teo-tsirpanis Jul 7, 2023

davisp Jul 8, 2023

davisp commented Jul 10, 2023

teo-tsirpanis commented Jul 10, 2023

KiterLuc Jul 11, 2023

davisp Jul 13, 2023 •

edited

Loading


		#include <iostream>

		TEST_CASE(

Add Enumerated Data Types #4051

Add Enumerated Data Types #4051

Conversation

davisp commented Apr 20, 2023 • edited Loading

davisp commented Apr 20, 2023 • edited Loading

KiterLuc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davisp commented May 9, 2023 • edited Loading

davisp commented May 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davisp commented Jul 7, 2023

teo-tsirpanis commented Jul 7, 2023

eddelbuettel commented Jul 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davisp commented Jul 10, 2023

teo-tsirpanis commented Jul 10, 2023

Choose a reason for hiding this comment

davisp Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

davisp commented Apr 20, 2023 •

edited

Loading

davisp commented Apr 20, 2023 •

edited

Loading

davisp commented May 9, 2023 •

edited

Loading

davisp Jul 13, 2023 •

edited

Loading