Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load enumerations for all array schemas in a single request. #5349

Merged
merged 12 commits into from
Oct 17, 2024

Conversation

shaunrd0
Copy link
Contributor

This fixes a performance regression added in #5291 while loading enumerations for all array schemas. The regression impacts arrays with multiple evolved schemas. Previously #5291 introduced a loop over REST requests for-each schema, this PR loads all enumerations for all schemas in a single request.


TYPE: C_API
DESC: Introduce tiledb_array_load_enumerations_all_schemas

TYPE: CPP_API
DESC: Introduce ArrayExperimental::load_enumerations_all_schemas

@shaunrd0 shaunrd0 requested a review from ypatia October 15, 2024 18:07
@shaunrd0 shaunrd0 force-pushed the smr/sc-55029/load-enmrs-all-schemas branch from 227549e to 5285bb7 Compare October 15, 2024 18:52
}
}

return ret;
}

std::vector<shared_ptr<const Enumeration>> Array::get_enumerations(
Copy link
Contributor Author

@shaunrd0 shaunrd0 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted changes to this function from #5237.

get_enumerations_all_schemas() {
return array_->get_enumerations_all_schemas();
}

std::vector<shared_ptr<const tiledb::sm::Enumeration>> get_enumerations(
Copy link
Contributor Author

@shaunrd0 shaunrd0 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted changes to this function from #5237.

@@ -663,6 +663,13 @@ capi_return_t tiledb_array_load_all_enumerations(const tiledb_array_t* array) {
return TILEDB_OK;
}

capi_return_t tiledb_array_load_enumerations_all_schemas(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this API instead of changing parameters of tiledb_array_load_all_enumerations allows this PR to fix the regression without changes to REST or other repositories. We will have a new binding to wrap but that's just for completeness, this will only be used by the client to submit the request to REST. Had we changed the parameters like I did initially REST would fail to link against this branch until we updated the modified binding for the APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we can't really change a C API after we have released, can we? Or that's some exception for experimentals? What we can do if it's incorrect though is mark it for deprecation. But rather than incorrect, is tiledb_array_load_all_enumerations just fetching the enumerations for the latest schema only?
Then IMO it's good choice to add another API anyway since it's serving a different purpose and has different performance impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC for experimental APIs we can change them since they are experimental, but it's a breaking change and I was trying to avoid doing that. Since tiledb_array_load_all_enumerations only loads for the latest schema I thought a new API would not be out of the question.

Yeah I think the name tiledb_array_load_all_enumerations is a little misleading as you probably noticed but I updated those docs to clarify and if we want to rename or restructure those methods I think we could do that separately and go through a deprecation cycle.


// For now, reopen the array if it's found to be using array open v1.
// Once we update to use post_array_schema_from_rest we will always have
// array_schemas_all initialized and this could go away.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to point these comments out. As I mentioned in the comment I could serialize allArraySchemas with the enumerations intact and that would be fine, but that would only be required until post_array_schema_from_rest is finished. Array open v3 is now the default so I thought this was better than locking us into serializing full schemas here.

If we want to just serialize all the schemas instead I can do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not do array operations under the hood. I have a more radical idea. How about returning a message at the C API level that this operation is unsupported for array v1 opened arrays and redirect the user to use v2 instead?

@ihnorton Are we ok to introduce new C APIs assuming array v2/query v3 now that it's default, and just be backwards compatible forever for the pre-existing ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I added exceptions in both the C API and in tiledb::sm::Array::load_all_enumerations in case it is called internally like we do in unit-enumerations.cc. I modified the existing tests to check for both exceptions if we are not using array open v2. The exception could be removed and tests updated as part of #5181.

@shaunrd0
Copy link
Contributor Author

Just marked as a draft until 59c529a is reverted, this is ready for review. There is a passing run with Query / Array open v2 here: https://gitlab.com/TileDB-Inc/TileDB-Internal/-/jobs/8093304584

Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few comments and we're good. Thanks for that complex fix!

throw ArrayException("Unable to load enumerations; Array is not open.");
}

std::unordered_map<std::string, std::vector<shared_ptr<const Enumeration>>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: a general comment, those complex types are easier to read if modeled as structs so that we avoid .second type of names that are hard to remember what they refer to. No need to change it here, just stating my preference :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I did not change it here but I'll keep it in mind for the future.. I have been confused by .first and .second in the past so I agree, a structure would have been better.

@@ -663,6 +663,13 @@ capi_return_t tiledb_array_load_all_enumerations(const tiledb_array_t* array) {
return TILEDB_OK;
}

capi_return_t tiledb_array_load_enumerations_all_schemas(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we can't really change a C API after we have released, can we? Or that's some exception for experimentals? What we can do if it's incorrect though is mark it for deprecation. But rather than incorrect, is tiledb_array_load_all_enumerations just fetching the enumerations for the latest schema only?
Then IMO it's good choice to add another API anyway since it's serving a different purpose and has different performance impact.

* @param ctx The context to use.
* @param array The array to load enumerations for.
*/
static void load_enumerations_all_schemas(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be experimental for some reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this in experimental because the other enumerations APIs were still experimental and it felt weird to add a new API that was so similar outside of experimental


// For now, reopen the array if it's found to be using array open v1.
// Once we update to use post_array_schema_from_rest we will always have
// array_schemas_all initialized and this could go away.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not do array operations under the hood. I have a more radical idea. How about returning a message at the C API level that this operation is unsupported for array v1 opened arrays and redirect the user to use v2 instead?

@ihnorton Are we ok to introduce new C APIs assuming array v2/query v3 now that it's default, and just be backwards compatible forever for the pre-existing ones?

@shaunrd0 shaunrd0 force-pushed the smr/sc-55029/load-enmrs-all-schemas branch from 5dcc0cd to 3e1787d Compare October 16, 2024 21:03
Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

tiledb/api/c_api/array/array_api_experimental.h Outdated Show resolved Hide resolved
@shaunrd0 shaunrd0 force-pushed the smr/sc-55029/load-enmrs-all-schemas branch from 093c2d7 to 8ec2827 Compare October 17, 2024 13:32
@shaunrd0 shaunrd0 marked this pull request as ready for review October 17, 2024 13:33
@shaunrd0 shaunrd0 merged commit b1dd2b4 into dev Oct 17, 2024
63 checks passed
@shaunrd0 shaunrd0 deleted the smr/sc-55029/load-enmrs-all-schemas branch October 17, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants