Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Enumerated Data Types #4051

Merged
merged 1 commit into from
Jul 20, 2023
Merged

Add Enumerated Data Types #4051

merged 1 commit into from
Jul 20, 2023

Conversation

davisp
Copy link
Contributor

@davisp davisp commented Apr 20, 2023

This PR adds the Enumerated data types. Enumerated data types work by
adding an Enumeration to the ArraySchema, setting an enumeration name on
an attribute, and then adding the attribute to the ArraySchema.

An Enumeration object contains a short list of options and a vector of
values. An attribute that has an enumeration name set must have an
integral type that is wide enough to index all of the enumerated values.

Changes to the values of an enumeration (any of adding, renaming, or
removing) can be accomplished via ArraySchemaEvolution.


TYPE: FEATURE
DESC: Enumerated data types

@davisp davisp requested a review from KiterLuc April 20, 2023 22:36
@davisp
Copy link
Contributor Author

davisp commented Apr 20, 2023

@KiterLuc Could you take a look at the enumeration loading logic here:

https://github.com/TileDB-Inc/TileDB/pull/4051/files#diff-e19b0b9e093512958297fb53f9e1ac2e0f8ac7320c0ce5ea19794a67e3acc4f8R849-R874

(Link is to the changes in tiledb/sm/query/query.cc)

I'm pretty sure this will run on the cloud side of a remote query so it should work? I'm mostly not sure how this would interact when we have to submit a query multiple times before completion. While I think it should work, I'm also not sure there's not a better place for that logic.

Copy link
Contributor

@KiterLuc KiterLuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial feedback while scanning to load everything into memory... I'll do another pass focussing more on storage format on the next revision.

test/src/unit-cppapi-enumerations.cc Outdated Show resolved Hide resolved
test/src/unit-cppapi-enumerations.cc Outdated Show resolved Hide resolved
test/src/unit-cppapi-enumerations.cc Outdated Show resolved Hide resolved
tiledb/api/c_api/enumeration/enumeration_api.cc Outdated Show resolved Hide resolved
TEST_CASE("C++ API: Enumeration creation fixed size", "[cppapi][enumeration]") {
TestData td;

std::vector<uint32_t> values = {1, 2, 3, 4, 5};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I really like how all test cases have these 'paragraph'. Would it be possible to add a one liner comment to each so that it's easy to scan the file and see what each test case intends on doing without reading the whole code? Later down the line it makes it easier to maintain the tests or scan for missing coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests will probably change drastically. I just did the first thing that came to mind to test while I was developing. I like to write the implementation and then come back later with my brain in testing mode to write out a comprehensive test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've massively rewritten this test suite so its completely different so I'm going to let you check and see if its documented enough to your liking. The bulk of these tests are moved to unit-enumerations.cc which uses the internal C++ APIs (as opposed to the API in tiledb/sm/cpp_api/). All of the tests are short and have descriptive names that describe what's being tested.

tiledb/sm/cpp_api/array_schema_experimental.h Outdated Show resolved Hide resolved
tiledb/sm/cpp_api/array_experimental.h Show resolved Hide resolved
tiledb/sm/cpp_api/enumeration_experimental.h Outdated Show resolved Hide resolved
tiledb/sm/array_schema/enumeration.h Outdated Show resolved Hide resolved
tiledb/sm/array_schema/enumeration.h Outdated Show resolved Hide resolved
@davisp davisp force-pushed the pd/experiment/enums branch 4 times, most recently from c15874e to a8f12ea Compare May 5, 2023 21:38
@davisp davisp force-pushed the pd/experiment/enums branch 2 times, most recently from 5ae0fcf to 4928106 Compare May 9, 2023 22:46
@davisp
Copy link
Contributor Author

davisp commented May 9, 2023

@KiterLuc It took me longer to address you're earlier feed back about all the magic number conversions in query_ast.cc so I didn't get nearly as far as I was planning today.

I've also not gone through and address all of your earlier comments around things like documentation, however I'm pretty sure all of the magic numbers and things like UINT32_MAX instead of TILEDB_VAR_NUM have all been fixed (though I haven't audited to be absolutely sure). Other than that, there's a couple missing internal APIs exposed to the C and CPP APIs.

Re-reading the storage stuff and then going through test/src/unit-enumerations.cc should hopefully be a pretty comprehensive overview of the core behavior where everything else is basically just bookkeeping and plumbing.

@davisp
Copy link
Contributor Author

davisp commented May 9, 2023

Also, here's my current coverage report as of a few seconds before posting this. N.B. Gists truncate after the first 1M bytes so the last few diff's aren't included.

https://gistpreview.github.io/?95c129f102d34cda4f38e7ff9331b84e

@davisp davisp force-pushed the pd/experiment/enums branch from 4928106 to 16ef8d2 Compare May 11, 2023 21:51
@davisp davisp requested a review from KiterLuc May 12, 2023 19:39
@davisp davisp force-pushed the pd/experiment/enums branch 4 times, most recently from 32a8576 to 4c08bb3 Compare May 15, 2023 16:31
scripts/generate-coverage-report.py Outdated Show resolved Hide resolved
test/src/unit-cppapi-enumerations.cc Outdated Show resolved Hide resolved
test/src/unit-cppapi-enumerations.cc Show resolved Hide resolved
test/src/unit-misc-util-safe-integral-casts.cc Outdated Show resolved Hide resolved

| **Field** | **Type** | **Description** |
| :--- | :--- | :--- |
| Version number | `uint32_t` | Format version number of the generic tile |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate? The format version of the generic tile should be in the generic tile header... So I'm not sure why it would be included here again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was implemented quite poorly before by attempting to re-use the format_version as the enumerations version. Since then I've split the naming and versioning so that things are more clear.

You were quite right that the format version is already part of the GenericTileIO header so if we really needed it, we could have gotten it there.

This however is a version for the Enumeration data itself. It now starts at 0, and if we ever have to change this (de)?serialization implementation for the contents of the GenericTileIO "payload" we can do that by using this Enumerations specific version.

tiledb/sm/array_schema/array_schema.h Outdated Show resolved Hide resolved
tiledb/sm/array_schema/array_schema_evolution.h Outdated Show resolved Hide resolved
tiledb/sm/array_schema/enumeration.h Outdated Show resolved Hide resolved
tiledb/sm/array_schema/enumeration.cc Show resolved Hide resolved
@davisp davisp force-pushed the pd/experiment/enums branch 9 times, most recently from 7937d8f to 534d4a2 Compare June 14, 2023 20:42
@davisp
Copy link
Contributor Author

davisp commented Jul 7, 2023

@teo-tsirpanis You mean just drop something into the examples directory to show how it works?

@teo-tsirpanis
Copy link
Member

@davisp yes. You can do it later if it's too much work but would be nice to have.

@eddelbuettel
Copy link
Contributor

@teo-tsirpanis I tossed a simple example in one our channels yesterday, I will DM you that. It would be better for @davisp to add an official 'blessed' example or two as the API changed a little and got 'richer'. I may be behind the curve.

Comment on lines +424 to +469
/**
* Retrieves an attribute's enumeration given the attribute name (key).
*
* **Example:**
*
* The following retrieves the first attribute in the schema.
*
* @code{.c}
* tiledb_attribute_t* attr;
* tiledb_array_schema_get_enumeration(
* ctx, array_schema, "attr_0", &enumeration);
* // Make sure to delete the retrieved attribute in the end.
* @endcode
*
* @param ctx The TileDB context.
* @param array The TileDB array.
* @param name The name (key) of the attribute from which to
* retrieve the enumeration.
* @param enumeration The enumeration object to retrieve.
* @return `TILEDB_OK` for success and `TILEDB_ERR` for error.
*/
TILEDB_EXPORT capi_return_t tiledb_array_get_enumeration(
tiledb_ctx_t* ctx,
const tiledb_array_t* array,
const char* name,
tiledb_enumeration_t** enumeration) TILEDB_NOEXCEPT;

/**
* Load all enumerations for the array.
*
* **Example:**
*
* @code{.c}
* tiledb_array_load_all_enumerations(ctx, array);
* @endcode
*
* @param ctx The TileDB context.
* @param array The TileDB array.
* @param latest_only If non-zero, only load enumerations for the latest schema.
* @return `TILEDB_OK` for success and `TILEDB_ERR` for error.
*/
TILEDB_EXPORT capi_return_t tiledb_array_load_all_enumerations(
tiledb_ctx_t* ctx,
const tiledb_array_t* array,
int latest_only) TILEDB_NOEXCEPT;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these functions part of the array API? Shouldn't they belong to array schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ArraySchema has no access to the ContextResources/ArrayDirectory used for actually loading things from disk so I added it to the Array instead of forcing users to do something along the lines of schema->load_enumeration(array->array_directory()) or some such.

@davisp
Copy link
Contributor Author

davisp commented Jul 10, 2023

@teo-tsirpanis I've added an example here: befe066

Let me know if you'd like to add anything else to it.

@teo-tsirpanis
Copy link
Member

Thanks!

test/src/unit-capi-enumerations.cc Outdated Show resolved Hide resolved

#include <iostream>

TEST_CASE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason these tests can't live in tiledb/api/capi?

Copy link
Contributor Author

@davisp davisp Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all testing API's from the Attribute, ArraySchema, and Array classes which aren't yet migrated to the new API subdirectory. Accidentally replied in the wrong spot yesterday, so this comment is moved.

format_spec/enumeration.md Show resolved Hide resolved
test/src/unit-cppapi-enumerations.cc Outdated Show resolved Hide resolved
test/src/unit-enum-helpers.cc Outdated Show resolved Hide resolved
tiledb/sm/array_schema/enumeration.h Show resolved Hide resolved
tiledb/sm/array_schema/enumeration.h Show resolved Hide resolved
tiledb/sm/array_schema/enumeration.h Outdated Show resolved Hide resolved
tiledb/sm/cpp_api/enumeration_experimental.h Show resolved Hide resolved
tiledb/sm/cpp_api/enumeration_experimental.h Show resolved Hide resolved
@KiterLuc KiterLuc force-pushed the pd/experiment/enums branch from 7a226f2 to 5e509ba Compare July 14, 2023 08:28
@davisp davisp force-pushed the pd/experiment/enums branch 5 times, most recently from 18ac027 to 3c93077 Compare July 19, 2023 20:50
This PR adds the Enumerated data types. Enumerated data types work by
adding an Enumeration to the ArraySchema, setting an enumeration name on
an attribute, and then adding the attribute to the ArraySchema.

An Enumeration object contains a short list of options and a vector of
values. An attribute that has an enumeration name set must have an
integral type that is wide enough to index all of the enumerated values.

Changes to the values of an enumeration (any of adding, renaming, or
removing) can be accomplished via ArraySchemaEvolution.
@davisp davisp force-pushed the pd/experiment/enums branch from 3c93077 to 1bd816f Compare July 20, 2023 15:36
@davisp davisp merged commit c0d7c6a into dev Jul 20, 2023
@ihnorton ihnorton deleted the pd/experiment/enums branch July 24, 2023 12:53
eric-hughes-tiledb added a commit that referenced this pull request May 13, 2024
…ce with standalone link policy

PR #4051 took object library `generic_tile_io` out of conformance with the policy that each OL should link standalone. This PR corrects this.

Note: In its present state this PR is not suitable for review or merge. It's branched from a branch that itself has not merged yet and needs to be rebased before review is feasible.
eric-hughes-tiledb added a commit that referenced this pull request May 14, 2024
…ne link policy

PR #4051 took object library `generic_tile_io` out of conformance with the policy that each OL should link standalone. This PR corrects this.
KiterLuc pushed a commit that referenced this pull request May 16, 2024
…ne link policy (#4975)

PR #4051 took object library `generic_tile_io` out of conformance with
the policy that each OL should link standalone. This PR corrects this.

[sc-47341]

---
TYPE: NO_HISTORY
DESC: Bring object library `generic_tile_io` into conformance with
standalone link policy
KiterLuc pushed a commit that referenced this pull request Jul 25, 2024
[SC-51428](https://app.shortcut.com/tiledb-inc/story/51428/enumeration-path-map-does-not-exist-in-the-array-schema-format-spec)

I noticed that the array schema format specification does not include
the enumeration name-path map introduced in #4051. This PR updates the
documentation.

I used the term "enumeration filename" to describe the string written
after the enumeration name because [it is just the file's
name](https://github.com/TileDB-Inc/TileDB/blob/78ac1d2ec338fd468eb63481e85049215908e39f/tiledb/sm/array/array_directory.cc#L1324-L1326),
and updated previous usages of "enumeration pathname" or "enumeration
URI" in code.

---
TYPE: NO_HISTORY
DESC: Added documentation for the enumeration path map in array scehmas,
present since format version 20.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants