Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] add _cat/ml/trained_models API #51529

Merged

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jan 28, 2020

This adds _cat/ml/trained_models.

Certain pieces of data are in the config that do not exist in the stats response. Additionally, conditionally knowing what data frame analytics job created the model (if the job still exists) would be nice information.

Examples:

# GET _cat/ml/trained_models?v
id                           heap_size operations ingest.pipelines
ddddd-1580216177138          3.5mb     196        0
flight-regress-1580215685537 1.7mb     102        0
lang_ident_model_1           1mb       39629      0

# GET _cat/ml/trained_models?h=*&v
id                           created_by heap_size operations license  create_time              version description                                                    data_frame_analytics_id ingest.pipelines ingest.count ingest.time ingest.current ingest.failed
ddddd-1580216177138              _xpack 3.5mb     196        PLATINUM 2020-01-28T12:56:17.138Z 8.0.0                                                                  ddddd                   0                0            0s          0              0
flight-regress-1580215685537     _xpack 1.7mb     102        PLATINUM 2020-01-28T12:48:05.537Z 8.0.0                                                                  flight-regress          0                0            0s          0              0
lang_ident_model_1               _xpack 1mb       39629      BASIC    2019-12-05T12:28:34.594Z 7.6.0   Model used for identifying language from arbitrary input text. __none__                0                0            0s          0              0

# GET _cat/ml/trained_models?help
id                      |                       | the trained model id                                                          
created_by              | c,createdBy           | who created the model                                                         
heap_size               | hs,modelHeapSize      | the estimated heap size to keep the model in memory                           
operations              | o,modelOperations     | the estimated number of operations to use the model                           
license                 | l                     | The license level of the model                                                
create_time             | ct                    | The time the model was created                                                
version                 | v                     | The version of Elasticsearch when the model was created                       
description             | d                     | The model description                                                         
data_frame_analytics_id | df,dataFrameAnalytics | The data frame analytics config id that created the model (if still available)
ingest.pipelines        | ip,ingestPipelines    | The number of pipelines referencing the model                                 
ingest.count            | ic,ingestCount        | The total number of docs processed by the model                               
ingest.time             | it,ingestTime         | The total time spent processing docs with this model                          
ingest.current          | icurr,ingestCurrent   | The total documents currently being handled by the model                      
ingest.failed           | if,ingestFailed       | The total count of failed ingest attempts with this model                     

The tricky code here is finding dataframe analytics configs that match up with the trained models. If folks want to get 1000s of trained models in this call, and each one has 10+ unique tags, it could be that our dataframe analysis query is too large. I think it is good to throw in that situation (which is done automatically if the paging params are out of bounds). Folks can filter down this request with trained model ids and paging params.

closes #51414

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM. I'd like a 3rd opinion on what fields are returned and the cat params from a DFAer.

Why not return all trained models instead of just DFA models, then mark which come from a DFA. This could be done with tags although that would be a problem for existing DFA models that don't have this tag. Maybe do the opposite and tag user generated models then filter by that tag?

Additionally, conditionally knowing what data frame analytics job created the model
Yes that would be nice

}
GetTrainedModelsStatsAction.Request statsRequest = new GetTrainedModelsStatsAction.Request(modelId);
GetTrainedModelsAction.Request modelsAction = new GetTrainedModelsAction.Request(modelId, false, null);
if (restRequest.hasParam(PageParams.FROM.getPreferredName()) || restRequest.hasParam(PageParams.SIZE.getPreferredName())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only cat action that supports paging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, and it probably should as there is a limit of 10K when reading configs from an index.
Our _cat APIs are the only ones that return data that are stored in indices (i think)

]
},
"params":{
"allow_no_match":{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cat indices does not have the allow_no_indices option. Is this conventional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches up with our GET <resource> pattern.

I could remove the option and make it always true.

Set<String> potentialAnalyticsIds = new HashSet<>();
// Analytics Configs are created by the XPackUser
trainedModelConfigs.stream()
.filter(c -> XPackUser.NAME.equals(c.getCreatedBy()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a better way of differentiating user models and DFA models in the future. Maybe a reserved tag for DFA models

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidkyle possibly. Users cannot set the created_by, and XPackUser.NAME is a reserved name.

But I see your point for models we provide as a resource. Those weren't created by a DFA.


client.execute(GetTrainedModelsStatsAction.INSTANCE,
statsRequest,
ActionListener.wrap(groupedListener::onResponse, groupedListener::onFailure));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the wrap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, generic types get upset if there is no wrapper.

Map<String, String> analyticsMap = analyticsConfigs.stream()
.map(DataFrameAnalyticsConfig::getId)
.collect(Collectors.toMap(Function.identity(), Function.identity()));
logger.warn("ANALYTICS MAP " + analyticsMap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left over debug?

@benwtrent
Copy link
Member Author

I'd like a 3rd opinion on what fields are returned and the cat params from a DFAer.

🤔 interesting, like a data_frame prefixed set of options. I will see what I can do

@davidkyle
Copy link
Member

I'd like a 3rd opinion on what fields are returned and the cat params from a DFAer.

🤔 interesting, like a data_frame prefixed set of options. I will see what I can do

I was meaning another review from someone who's worked on the DFA code, but what your suggesting also sounds good

@benwtrent
Copy link
Member Author

Why not return all trained models instead of just DFA models, then mark which come from a DFA.

The _cat API does that. Those that don't have a DFA are flagged as such (dataframe id is __none__).

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent removed the request for review from dimitris-athanasiou February 4, 2020 12:01
@Winterflower
Copy link

What does this mean the estimated number of operations to use the model ?
Sorry super n00b question, but what operations are we talking about here? The number of times a model has been "called" by a job?

@benwtrent
Copy link
Member Author

benwtrent commented Feb 4, 2020

What does this mean the estimated number of operations to use the model?
Sorry super n00b question, but what operations are we talking about here? The number of times a model has been "called" by a job?

It is a way to help users measure model "complexity". It is an estimation of the number of arithmetic operations to use the model in inference.

Having memory + arithmetic operations allows users to make decisions around "simple" vs "complex" models.

@Winterflower

@Winterflower
Copy link

What does this mean the estimated number of operations to use the model?
Sorry super n00b question, but what operations are we talking about here? The number of times a model has been "called" by a job?

It is a way to help users measure model "complexity". It is an estimation of the number of arithmetic operations to use the model in inference.

Having memory + arithmetic operations allows users to make decisions around "simple" vs "complex" models.

@Winterflower

Thanks for the reply @benwtrent ! I originally assumed that you were using the number of operations as a proxy for "model complexity" (as in when we say that a neural network has a higher complexity than a linear model), which is where the confusion seemed to arise. But Valeriy has clarified that you are indeed using the operations number to estimate computational complexity not informational complexity.

@benwtrent benwtrent merged commit 374eca7 into elastic:master Feb 5, 2020
@benwtrent benwtrent deleted the feature/ml-_cat-trainedmodels-api branch February 5, 2020 12:09
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Feb 5, 2020
This adds _cat/ml/trained_models.
benwtrent added a commit that referenced this pull request Feb 5, 2020
* [ML] add _cat/ml/trained_models API (#51529)

This adds _cat/ml/trained_models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create GET _cat/ml/trained_models API
5 participants