Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support Relational Entity Store #1811

Closed
10 tasks done
xloya opened this issue Jan 31, 2024 · 13 comments
Closed
10 tasks done

[FEATURE] Support Relational Entity Store #1811

xloya opened this issue Jan 31, 2024 · 13 comments
Assignees
Labels
epic Key feature

Comments

@xloya xloya added the feature New feature or request label Jan 31, 2024
@jerryshao
Copy link
Contributor

Yes, it can be supported. Can you please draft a design doc about this, so that we can discuss on this?

@xloya
Copy link
Contributor Author

xloya commented Jan 31, 2024

Yes, it can be supported. Can you please draft a design doc about this, so that we can discuss on this?

Okay, I'll try to take the doc out this week

@coolderli
Copy link
Contributor

coolderli commented Jan 31, 2024

@xloya @jerryshao

Implementing a relation model is a litter complex. I think we should keep the interface com.datastrato.gravitino.EntityStore not changed.
For now, I think we can use type string, key string, value string instead of the key- value model.

In key-value mode, we store entities like this;

key                                                                    value
ta_{ml_id}_{ca_id}_{sc_id}_{table_id}    -----    table_info

In relational mode, we can store entities like this:

Type      key                                                       value
ta        {ml_id}_{ca_id}_{sc_id}_{table_id}   ----- table_info

The ListTable can be translated to a SQL like:

select value from xx where type = 'ta' and key like '{ml_id}_{ca_id}_{sc_id}%'

We can create a unique index on the type and key for better performance.
What do you think about this?

@xloya
Copy link
Contributor Author

xloya commented Jan 31, 2024

@xloya @jerryshao

Implementing a relation model is a litter complex. I think we should keep the interface com.datastrato.gravitino.EntityStore not changed. For now, I think we can use type string, key string, value string instead of the key- value model.

In key-value mode, we store entities like this;

key                                                                    value
ta_{ml_id}_{ca_id}_{sc_id}_{table_id}    -----    table_info

In relational mode, we can store entities like this:

Type      key                                                       value
ta        {ml_id}_{ca_id}_{sc_id}_{table_id}   ----- table_info

The ListTable can be translated to a SQL like:

select value from xx where type = 'ta' and key like '{ml_id}_{ca_id}_{sc_id}%'

We can create a unique index on the type and key for better performance. What do you think about this?

I'm not sure if using a relational database to emulate a KV store is a good practice. There are currently no examples of such use in a production environment. I may need to look at other's opinions. In addition, here is the doc that I simply compiled to support relational entity store: https://docs.google.com/document/d/1egaCTa5MvF-6seKQ-mxoGAS-0tFPzvyBeCt1Wq6bCl8/edit. I personally think that using a relational database to store relational data may be more in line with the meaning of components, and the performance may be better than fuzzy query.

@jerryshao
Copy link
Contributor

We don't have to emulate the behavior of kv storage. The design above is all about optimizing the key for kv storage. If you're using RDBMS, you can design a relational schema, don't need to stick to kv's storage layout.

@shaofengshi
Copy link
Contributor

In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.

@shaofengshi
Copy link
Contributor

In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.

We know that's not the right case to use a SQL database, but that's a compromise among the complexity, ease of maintainence and others. (the metadata of Kylin is complicated, many types, and each type may have many properties, and may change from version to version)

@coolderli
Copy link
Contributor

In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.

We know that's not the right case to use a SQL database, but that's a compromise among the complexity, ease of maintainence and others. (the metadata of Kylin is complicated, many types, and each type may have many properties, and may change from version to version)

@jerryshao @justinmclean @yuqi1129 @xloya @YxAc @zhoukangcn

I think the suggestion from @shaofengshi is very valuable. There will be more entity types in the future. If we use a relation structure we have to modify the schema of entity store when changes occur. Another issue is that it is difficult to migrate from a relational structure to a key-value structure.

Before proceeding further with this work, I would like to hear your thoughts. Thanks.

@jerryshao
Copy link
Contributor

I think we can bring this out to discussion in tomorrow's meeting. Both designs have pros and cons, maybe we can negotiate to find out a balance point.

@xloya
Copy link
Contributor Author

xloya commented Feb 19, 2024

In my opinion, the issues Kylin as an OLAP engine may face with metadata management are similar, but not necessarily identical, to metadata management systems such as Gravitino, Metacat, Hive Metastore, and Amoro.
Internally, we use MySQL as the relational backend of the metadata management system, which has been used very maturely in the Hive Metastore production environment and Metacat. I think the most criticized thing about HMS is the impact of excessive transaction usage on read performance, not the issue of Schema changes.
Regarding NetEase's Amoro project, I saw that they also use MySQL as a relational database. There doesn't seem to be any issue stating that they want to use MySQL as the KV store.

@jerryshao
Copy link
Contributor

From my point, I think it is not the point that we use RDBMS or not as a metadata storage, the point is how do we design a SCHEMA that fits RDBMS well, shall we normalize all the fields, that's the thing we should discuss.

@YxAc
Copy link
Contributor

YxAc commented Feb 19, 2024

It is true there are practices that use MySQL as KV, and in metadata scenarios, there are several reasons why I prefer to use relational tables:

  1. HMS's large scale production practice conveys confidence. Futhermore, Ali Cloud DLF, the unified metadata underlying layer also uses relational databases.

  2. The main concern for using KV database is that it is difficult to partition a hot spot of a large table. Using relational table with sharding can solve this problem to a certain extent.

  3. No need to migrate to KV database in the future, the complexity of KV maintenance is high, and there is low-cost both in terms of high availability and maintenance for MySQL.

@jerryshao jerryshao added this to the Gravitino 0.5.0 milestone Feb 20, 2024
@xloya
Copy link
Contributor Author

xloya commented Feb 22, 2024

Post the design doc of the table schema to here after online discussion on the multi-version issue of the relational model: https://docs.google.com/document/d/1vSicaybSJc7wHq6sAk4JR0eBy_PoO8HORSOihF4Ui90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Key feature
Projects
None yet
Development

No branches or pull requests

5 participants