Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] tags on model group&version #1303

Open
wujunshen opened this issue Sep 8, 2023 · 0 comments
Open

[FEATURE] tags on model group&version #1303

wujunshen opened this issue Sep 8, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request feature

Comments

@wujunshen
Copy link

wujunshen commented Sep 8, 2023

Goals

At present, we can allow users to upload and deploy many models, and carry out subsequent train, predict and other operations.

However, if users want to manage these models, for example, they want to categorize them based on applicable business scenarios, and quickly query and filter out which models are applicable to a specific business scenario, this may be very difficult or increase the user's effort.

Therefore, we introduce the concept of tag, which allows users to put a specific tag on some models, so that they can better manage models and quickly find the models they need.

And now in ml-commons, there are two system level indexes, model group and model version, on both of which we can allow users to tag, and let both of them reuse tags with each other.

For example, if we add a tag to a model group, then all the model versions in the group can use that tag, and for the same tag that is used, different model versions can each give different tag values or content. In the query, users can first query the value or content of the label, quickly find the model version that they need to use, which is more convenient and faster than directly querying the model, and then through the naked eye to determine whether they need.

Solution

The tag is divided into 3 parts

  • tag key
  • tag type (only String and Number two types, Number is float type in java)
  • tag value

When performing CRUD(Create/Re-query/Update/Delete) operations on the model group, we operate on the key&type. When performing CRUD operations on the model version, we operate on the key&value.

Changes to Index Mapping

Since model group and model version indexes already exist in OS(.plugins-ml-model-group and .plugins-ml-model), we need to modify their mapping definitions in order to adapt them to the new tags field first.

The .plugins-ml-model and .plugins-ml-model-group index data structure of the tags field can be defined as a list, where the elements are nested object of tags(the attributes include key&type in .plugins-ml-model-group and key&value in .plugins-ml-model), or as a map, where the elements use key-value(key-type in .plugins-ml-model-group and key-value in .plugins-ml-model).

If we design tags as a list:

Advantages

  • Clearer structure, each tag is an independent object, containing all relevant information.
  • Easy to query, you can directly query the attributes of the tag object.
  • It is easier to add and delete tags.

Disadvantages

  • Occupy more storage space.
  • It is not easy to do aggregation analysis, you need to expand the list first.

If we design tags as a map:

Advantages

  • Less storage space.
  • Easier to do aggregation analysis, can be directly based on the key for statistics.

Disadvantages

  • The structure is not clear enough, the key and value are scattered, the information is not centralized.
  • Query is relatively complex, you need to get the key first, and then get the value.
  • Adding and deleting tags requires updating both key and value.

It is recommended to use a list structure to define the tags field. Based on the following reasons:

  • It won't take up too much extra storage space.
  • Need to frequently add and delete tags, list operation is more simple.
  • Directly on the tag object to query more intuitive, centralized information.
  • If the map object(has many key-value pairs in 1 map object) is too large will cause storage pressure.
  • Aggregate statistics needs are limited, do not need to choose a complex structure of the map.

The mapping of .plugins-ml-model-group index may be:

{    
    "mappings": {
        //other fields
        "properties": {
            //other fields
            "tags": {
                "properties": {
                    "key": {
                        "type": "keyword"
                    },
                    "type": {
                        "type": "keyword"
                    }
                }
            },
            //other fields
        }
    }
}

The mapping of .plugins-ml-model index may be:

{    
    "mappings": {
        //other fields
        "properties": {
            //other fields
            "tags": {
                "properties": {
                    "key": {
                        "type": "keyword"
                    },
                    "value_s": {
                        "type": "keyword"
                    },
                    "value_n": {
                        "type": "float"
                    }
                }
            },
            //other fields
        }
    }
}

Query the .plugins-ml-model-group index, the response may be:

"_source": {
    // other fields
    "tags": [
       {
         "key": "tag1",
         "type": "String",
       },
       ...
       {
         "key": "tag10",
         "type": "Number",
       }
    ]
}

Query the .plugins-ml-model index, the response may be:

"_source": {
    // other fields
    "tags": [
       {
         "key": "tag1",
         "value_s": "abc"     
       },
       ...
       {
         "key": "tag10",
         "value_n": 100
       }
    ]
}

And why we defined "value_s" and "value_n" 2 fields ?The explanations are as follows.

Considering the following scenarios:

If user add a tag with a certain type, and this tag existed in history with a different type, when adding tag value to certain model version index, the could be exception since tag value is different with the metadata in index mapping.

After testing, we found that if we define "value_s" and "value_n" in the mapping of .plugins-ml-model index at beginning, this problem will not occur. "value_s" represents the value with tag type String, "value_n" represents the value with tag type Number.

Let's take an example and assume that for .plugins-ml-model index. We define the mapping as follow:

{    
    "mappings": {
        //other fields
        "properties": {
            //other fields
            "tags": {
                "properties": {
                    "key": {
                        "type": "keyword"
                    },
                    "value_s": {
                        "type": "keyword"
                    },
                    "value_n": {
                        "type": "float"
                    }
                }
            },
            //other fields
        }
    }
}

After the definition, we add a new document, two new tags, a value is a string type, a value is a number type is perfectly fine, for example

PUT /<index_name>/_doc/1/
requestbody:
{
    //other fields
    "tags": [
        {
            "key": "tag1",
            "value_s": "abc"
        },
        {
            "key": "tag2",
            "value_n": 1.0
        }
    ]
}

The insertion returns success, after which the search API is called, returning the two tags that have been inserted.

GET /_plugins/_ml/models/_search
requestbody:
{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "must_not": {
        "exists": {
          "field": "chunk_number"
        }
      }
    }
  },
  "from": 0,
  "size": 10
}
responsebody:{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": ".plugins-ml-model",
                "_id": "1",
                "_version": 1,
                "_seq_no": 0,
                "_primary_term": 1,
                "_score": 1.0,
                "_source": {
                    "tags": [
                        {
                            "value_s": "abc",
                            "key": "tag1"
                        },
                        {
                            "key": "tag2",
                            "value_n": 1.0
                        }
                    ]
                }
            }
        ]
    }
}

Then let the tags field use list data structure. After experiment, the query match_all, match, term, terms, range can find out the tags we want, so the user modifies the value of the tag, even if he or she changes the type of the value, it will not affect our query to find the desired result. Because even if the type is modified, we need to query the field is not same, when it is changed to String, as long as use "value_s" field to query,changed to Number, as long as use "value_n" field to query.

There is some additional benefits to using these 2 fields, which we describe in detail below:

In the model group scenario, we update the tag generally is to update the type field of the tag, suppose we change from String to Number, if there is no these 2 fields, then wherever this tag used in .plugins-ml-model index, the value of the tag must also be updated, otherwise the value of this tag can not be compatible with the new type Number, will certainly report an error. With these 2 fields, then all value_s field content can still exist. Even after the type is changed to Number, there is no content in the value_n field yet. It's just that subsequent queries can't find the data.

Then we consider the model version scenario, if the value of a tag with type String is changed to Number,for example,the tag value is “abc”,and it is changed to 1.0, we need to check whether the type of this tag is String or Number defined in the .plugins-ml-model-group index first. Even if the value of this tag is changed to Number,has become 1.0, and if we discover the type of this tag is still String, surely the value_s field should be queried. So it will not find the tag updated to 1.0. If the type of this tag has been changed to Number, then it will use the value_n field to query. It definitely will find the tag that become to 1.0, not the tag with the same key whose type is still String.

To summarize, defining these 2 fields, for the subsequent query and update tag in the model group&model version scenarios,these can be compatible with each other, will not cause compatibility errors, and don’t make the query results are not the results we expect to even more.

After redefining the mappings for these 2 indexes, let's look at how in the following scenarios to manipulate tags in the CRUD case of model group and model version, and design the APIs that may need to be added or modified.

Model Group Scenario

Define tag key in the Model Group's API, and perform CRUD operations on the key.

But in fact, we do not need to redesign the query operation for this case, because the current query API already exists, do not need to redefine or modify these, for the model version scenario is also the same, do not need to consider query operation.

Add Tag

In the scenario of registering and modifying a model group, we need to create a new tag.

model group register

register a new group to the model group index to save all the tag information (key) that has been selected in the model group index.

See https://opensearch.org/docs/latest/ml-commons-plugin/model-access-control#registering-a-model-group

The content of RequestBody is:

{
// other fields
"tags": [
        {
            "key": "tag1",
            "type": "String"
        },
        {
            "key": "tag2",
            "type": "Number"
        }
 ],
 // other fields
}

For this RequestBody's content, we have explained in previous sections why is the tags field designed as a list data structure.

model group update

Here we have 2 options, one is to reuse the original model group update API to implement it, and the other is to create a new API to implement it separately.

  • reuse the original API

When we use the update API for model group, we specify the tag keys to be added in the requestbody in the form of a list.

The update API for model group can be seen https://opensearch.org/docs/latest/ml-commons-plugin/model-access-control#updating-a-model-group

A possible requestbody would look like this:

PUT /_plugins/_ml/model_groups/<model_group_id>
requestbody:
{
"tags": [
    {
      "key": "tag1",
      "type": "String"
    },
    {
      "key": "tag2",
      "type": "Number"
    }
 ]
}

It is worth noting that, the elements in the tags field here are a list of the latest tag information after users have performed add, delete, and modify tag operations on the UI and are preparing to update the model group. This includes new and modified tags, while tags to be deleted will not be shown in this latest tag list.

The advantages and disadvantages of this approach are as follows:

advantage

* The development effort is relatively small

disadvantage

* It may not know which tags need to be `added`, `modified`, and `deleted` in this model group update.
  • create the new API

the url and requestbody may be as below:

PUT /_plugins/_ml/model_groups/<model_group_id>/tags
requestbody:
{
"tags": [
    {
      "key": "tag1",
      "type": "String"
    },
    {
      "key": "tag2",
      "type": "Number"
    }
 ]
}

The latest tags list is also a result of users have performed add, delete, and modify tag operations on the UI.

The advantages and disadvantages of this approach are exactly the opposite of the above.

We tend to prefer the first way(reuse the original API), the reason is also do not need to do much change to the code, as long as the latest tags list information will be updated in this model group.

Update Tag

In the scenario of modifying a model group, we need to update existed tag.

model group update

We have already discussed how to update the tag's type in this scenario when describing the benefits of using the "value_s" and "value_n" fields.

A possible requestbody would look like this:

{  
"tags": [
    {
      "key": "tag1",
      "type": "Number"
    },
    {
      "key": "tag2",
      "type": "String"
    }
 ]
}

The latest tags list is also a result of users have performed add, delete, and modify tag operations on the UI.

Delete tag

we are able to delete existed tag in the scenario of modifying a model group.

model group update

A possible requestbody would look like this:

{  
"tags": [
    {
      "key": "tag1",
      "type": "String"
    },
    {
      "key": "tag2",
      "type": "Number"
    }
 ]
}

The latest tags list is also a result of users have performed add, delete, and modify tag operations on the UI. Therefore, the tags that do not appear in this latest tags list are the ones we want to delete.

Model Version Scenario

We have already explained in the Model Group Scenario why we don't need to consider query operation, here we continue to explain the scenarios for add, update, delete tags.

Add Tag

In the scenario of registering and modifying a model version, we need to create a new tag.

model version register

register a new model version to the model version index to save all the tag information (key&value) that has been selected in the model group index.

See https://opensearch.org/docs/latest/ml-commons-plugin/api/#registering-a-model

The content of RequestBody is:

{
// other fields
"tags": [
    {
      "key": "tag1",
      "value_s": "abc"
    },
    {
      "key": "tag2",
      "value_n": 1.0
    }
 ],
 // other fields
}

model version update

Here are also 2 options, one is to reuse the original model version update API to implement it, and the other is to create a new API to implement it separately.

  • reuse the original API

But OS does not provide the API to modify the model version. This API is currently under development.

When we use this update API, we specify the tag keys to be added in the requestbody in the form of a list, and if there are no tags to be added, we don't explicitly declare them in the requestbody.

A possible requestbody would look like this:

PUT _plugins/_ml/models/<model_id>/update
requestbody:
{
// other fields
"tags": [
    {
      "key": "tag1",
      "value_s": "abc"
    },
    {
      "key": "tag2",
      "value_n": 1.0
    }
 ]
 // other fields
}
  • create the new API

the url and requestbody may be as below:

PUT /_plugins/_ml/model/<model_id>/tags
requestbody:
{
    "tags": [
        {
            "key": "tag1",
            "value_s": "abc"
        },
        {
            "key": "tag2",
            "value_n": 1.0
        }
 ]
}

which of these 2 approach are prefer is same as the first way(reuse the original API). see here

Same as described previously, the tags list in this requestbody is also a result of users have performed add, delete, and modify tag operations on the UI.

Update Tag

In the scenario of modifying a model version, we need to update existed tag.

model version update

As above, this API is still under development, but we can implement tag updates based on this undeveloped API

A possible requestbody would look like this:

PUT _plugins/_ml/models/<model_id>/update
requestbody:
{
"tags": [
    {
      "key": "tag1",
      "value_s": "abc"
    },
    {
      "key": "tag2",
      "value_n": 1.0
    }
 ]
}

Same as described previously, the tags list in this requestbody is also a result of users have performed add, delete, and modify tag operations on the UI.

But we need to be aware of the tags whose value has been modified: How to make sure that the value is modified from String to Number or from Number to String in tags that should be an error?

Possible impacts:

We have defined a tag in the model group index with a type of Number, but in the model version index related to this tag, we want to add or update the value of this tag, and if the user inputs the value as a String, it will be unacceptable!

Measures to address negative impacts:

We have already discussed how to update the tag's value in this scenario when describing the benefits of using the "value_s" and "value_n" fields.If we change the value as a String, the "value_s" field will be added a new String value.Otherwise, it just update the original "value_n" field. Subsequent inquiries are not affected.

Delete tag

In the scenario of modifying a model version, we need to delete existed tag.

model version update

A possible requestbody would look like this:

PUT _plugins/_ml/models/<model_id>/update
requestbody:
{
//other fields
"tags": [
    {
      "key": "tag1",
      "value_s": "abc"
    },
    {
      "key": "tag2",
      "value_n": 1.0
    }
 ]
}

Same as described previously, the tags list in this requestbody is also a result of users have performed add, delete, and modify tag operations on the UI. Therefore, the tags that do not appear in this latest tags list are the ones we want to delete.

Additional Context?

Not yet, feel free to help us add to the list, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature
Projects
Status: Backlog
Development

No branches or pull requests

2 participants