Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#91]feat(catalog): Hive schema entity serde and store support #208

Merged
merged 6 commits into from
Aug 14, 2023

Conversation

mchades
Copy link
Contributor

@mchades mchades commented Aug 10, 2023

What changes were proposed in this pull request?

  1. Add CommonSchema class and Schema proto for serde
  2. serde and store HiveSchema as CommonSchema
  3. hive schema operations support Graviton store

Why are the changes needed?

we could store the additional entity information to our own storage.

Fix: #91

Does this PR introduce any user-facing change?

Add some check for Graviton store

How was this patch tested?

UTs added

@mchades mchades requested a review from jerryshao August 10, 2023 06:29
@github-actions
Copy link

github-actions bot commented Aug 10, 2023

Code Coverage Report

Overall Project 59.58% -0.9% 🟢
Files changed 53.11% 🔴

Module Coverage
core 67.23% -1.04% 🔴
catalog-hive 59.5% -3.73% 🟢
Files
Module File Coverage
core ProtoEntitySerDe.java 63.14% 🟢
BaseSchema.java 58.13% -6.88% 🔴
GravitonEnv.java 0% -4.59% 🔴
SchemaEntitySerDe.java 0% 🔴
catalog-hive HiveCatalogOperations.java 69.56% -8.21% 🟢

@jerryshao
Copy link
Contributor

jerryshao commented Aug 10, 2023

@yuqi1129 , please help to review this.

Done

@mchades
Copy link
Contributor Author

mchades commented Aug 10, 2023

OK, Is it possible to move hive-related operations out of the backend transaction method. I still think it's a litter weird

@yuqi1129 If we move hive-related operations out of the backend transaction method, once hive-related failed, the Graviton store will not rollback

yuqi1129
yuqi1129 previously approved these changes Aug 10, 2023

return commonSchema;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't like this class CommonSchema here. CommonSchema has nothing to do with HiveSchema. Semantically you'll both have CommonSchema and HiveSchema, but actually CommonSchema is just a placeholder for schema info, not an entity.

If it is a placeholder, then it should have a different name, if it is a base class of HiveSchema, then the class organization should be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intention was to use commonSchema as a generic class for storing basic information. Any entity that does not require additional information can be serialized and deserialized using commonSchema.

Another mainly reason for needing the commonSchema class is that BaseSchema is an abstract class and cannot be instantiated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I think we can have several options:

  1. Remove abstract limitation for BaseSchema class.
  2. Directly create HiveSchema's serde to proto Schema.

Basically, each entity (with identifier) should be mapping to a metadata object. Semantically CommonSchema and HiveSchema are two different entities if they don't have inheritance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, option one was chosen

uint64 catalog_id = 2;
string name = 3;
optional string comment = 4;
map<string, string> properties = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to store comment/properties, they will be stored in hive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

.setName(schemaEntity.name())
.setAuditInfo(AuditInfoSerDe.ser(schemaEntity.auditInfo()))
.setComment(schemaEntity.getComment())
.putAllProperties(schemaEntity.getProperties())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you check the nullability of comment and properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment and properties are removed for now.

"com.datastrato.graviton.meta.rel.BaseSchema.CommonSchema",
"com.datastrato.graviton.proto.SchemaEntitySerDe",
"com.datastrato.graviton.catalog.hive.HiveSchema",
"com.datastrato.graviton.proto.SchemaEntitySerDe");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. why do you both have commonSchema and HiveSchema here?
  2. If you put CommonSchema and SchemaEntitySerDe here in core module, why do you meet class not found issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's review the serde procedure for HiveSchema as follows:

  • serialize:
  1. Find SchemaEntitySerDe class according to HiveSchema class.
  2. SchemaEntitySerDe serialize HiveSchema entity to Schema proto
  • deserialize:
  1. Find Schema proto class according to passed parameter class of HiveSchema
  2. deserialize data from bytesArray to Schema proto instance
  3. Find SchemaEntitySerDe class according to passed parameter class of HiveSchema
  4. SchemaEntitySerDe deserialize from Schema proto to BaseSchema instance

With above describtioin, we really don't need map key of CommonSchema(BaseSchema), I have removed CommonSchema.

client.dropDatabase(ident.name(), false, false, cascade);
store.executeInTransaction(
() -> {
store.delete(ident, SCHEMA);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to delete all tables under this schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, delete tables operation added.

} catch (EntityAlreadyExistsException e) {
throw new SchemaAlreadyExistsException(
String.format(
"Hive schema (database) '%s' already exists in Graviton store", ident.name()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A blank line after each exception to make the code more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -24,13 +24,20 @@ public class ProtoEntitySerDe implements EntitySerDe {
"com.datastrato.graviton.meta.BaseMetalake",
"com.datastrato.graviton.proto.BaseMetalakeSerDe",
"com.datastrato.graviton.meta.CatalogEntity",
"com.datastrato.graviton.proto.CatalogEntitySerDe");
"com.datastrato.graviton.proto.CatalogEntitySerDe",
"com.datastrato.graviton.catalog.hive.HiveSchema",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you're using HiveSchema with SchemaEntitySerDe, but the type parameter of SchemaEntitySerDe is BaseSchema, do we need to unify to BaseSchema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping indicates that the HiveSchema can utilize SchemaEntitySerDe.
In the future, we may also require other catalogs like xxxSchema (if xxx also do not require storing additional information) to make use of this schemaEntitySerDe, at that time, we can directly add a mapping relationship: xxxSchema -> SchemaEntitySerDe.

So the BaseSchema type parameter of SchemaEntitySerDe is used to reusing serder logic.

// TODO. We should also fetch the customized HiveSchema entity fields from our own
// underlying storage, like id, auditInfo, etc.
EntityStore store = GravitonEnv.getInstance().entityStore();
BaseSchema baseSchema = store.get(ident, SCHEMA, HiveSchema.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use HiveSchema.class as a parameter, but the returned type is BaseSchema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to the specific type of schema when converting schema from a byte array, what's is why we need HiveSchema.class not BaseSchema.class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid confusion, I changed it to BaseSchema baseSchema = store.get(ident, SCHEMA, BaseSchema.class) here and then added BaseSchema serde mapping.

@jerryshao jerryshao merged commit bf288fa into apache:main Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] Add Hive metadata entity serde and storage support
3 participants