-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add PyTorch model configuration #71035
Conversation
Pinging @elastic/ml-core (Team:ML) |
@@ -173,9 +180,12 @@ | |||
// Model | |||
namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class, Tree.NAME.getPreferredName(), Tree::new)); | |||
namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class, Ensemble.NAME.getPreferredName(), Ensemble::new)); | |||
namedWriteables.add(new NamedWriteableRegistry.Entry(LangIdentNeuralNetwork.class, | |||
namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the model definitions are always streamed as compressed strings we never lookup the named writable for TrainedModel
// if (ExceptionsHelper.requireNonNull(estimatedOperations, ESTIMATED_OPERATIONS) < 0) { | ||
// throw new IllegalArgumentException("[" + ESTIMATED_OPERATIONS.getPreferredName() + "] must be greater than or equal to 0"); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷
public long estimatedNumOperations() { | ||
return 0; | ||
} | ||
|
||
@Override | ||
public long ramBytesUsed() { | ||
return SHALLOW_SIZE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is temporary, but we will definitely need this populated in the future. This way we know if there is enough free resources to assign a model to the node.
Also, I suggest simply making estimatedNumOperations
a 1
if we are not setting it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put this on my TODO list. Yes the model will use memory but that is native memory not JVM, for the purpose of accounting the shallow size is right. Perhaps we add a long nativeRamBytesUsed()
method for models loaded in a native process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before release this needs to be integrated with the MlMemoryTracker
and NodeLoadDetector
classes. They will need to track memory requirement for the 3 types of things we now have (anomaly detector jobs, data frame analytics jobs and native trained models), and then the node selectors will need to take into account the sum of all the requirements on each node.
I guess it's a non-trivial problem to know how much memory a PyTorch model will require when loaded given only the size of its .pt
file to work with. We'll have to do some experiments and see if there's an approximate formula that gives reasonable results. We also need to account for the fact that the .pt
file is held in the C++ process's heap while it loads the model, so total requirement will be sizeof(loaded model) + sizeof(.pt file) + sizeof(static overhead) + sizeof(code)
.
...c/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/pytorch/PyTorchModel.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/pytorch/PyTorchModel.java
Outdated
Show resolved
Hide resolved
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/TrainedModelConfig.java
Outdated
Show resolved
Hide resolved
...l/src/main/java/org/elasticsearch/client/ml/inference/trainedmodel/pytorch/PyTorchModel.java
Outdated
Show resolved
Hide resolved
...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPutTrainedModelAction.java
Outdated
Show resolved
Hide resolved
...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPutTrainedModelAction.java
Outdated
Show resolved
Hide resolved
/^ id \s+ heap_size \s+ operations \s+ create_time \s+ ingest\.pipelines \s+ data_frame\.id \n | ||
(a\-regression\-model\-0 \s+ \w+ \s+ \d+ \s+ .*? \s+ \d+ \s+ .*? \n)+ $/ | ||
/id\s+heap_size\s+operations\s+create_time\s+type\s+ingest\.pipelines\s+data_frame\.id\s*\n | ||
(a\-regression\-model\-0\s+\w+\s+\d+\s+\S+\s+tree_ensemble\s+\d+\s+\w+\n)+$/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces are there for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be more readable but it makes the test very brittle. I've found this is still quite readable as each column is delineated by a \s+
and the regex is easer to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be more readable but it makes the test very brittle.
I don't understand how it makes the test more brittle. Tests don't fail more often or not due to the spaces and to me, it seems WAY easier to know what column my current regex clause is messing with when there are white spaces.
x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/trained_model_cat_apis.yml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
This reverts commit 83fc8c98c7592690790157b9ad1a0c6e4e781.
Adds the model_type field to TrainedModelConfig for distinguishing between models |
The feature branch contains changes to configure PyTorch models with a TrainedModelConfig and defines a format to store the binary models. The _start and _stop deployment actions control the model lifecycle and the model can be directly evaluated with the _infer endpoint. 2 Types of NLP tasks are supported: Named Entity Recognition and Fill Mask. The feature branch consists of these PRs: #73523, #72218, #71679 #71323, #71035, #71177, #70713
Adds the
model_type
field toTrainedModelConfig
for distinguishing between models that can be loaded via the model loading service and those that require a native process.model_type
now appears in the CAT trained models action.Existing models without a
model_type
must be either a tree ensemble or the land ident model. I've added the field to the lang ident config so all models with a nullmodel_type
must be a tree ensemble.model_type
is set on creation of new models either by the user or by interrogating the TrainedModelDefinition. I didn't want to break the API for existing users by requiringmodel_type
is set since it can be set automatically.The new class
PyTorchModel
implementsTrainedModel
and has a simple definition which is just the ID of the PyTorch model it uses. Loading the PyTorch model and checking it exists when theTrainedModelConfig
is PUT is a TODO