[ML] Add PyTorch model configuration #71035

davidkyle · 2021-03-30T08:45:31Z

Adds the model_type field to TrainedModelConfig for distinguishing between models that can be loaded via the model loading service and those that require a native process. model_type now appears in the CAT trained models action.

Existing models without a model_type must be either a tree ensemble or the land ident model. I've added the field to the lang ident config so all models with a null model_type must be a tree ensemble. model_type is set on creation of new models either by the user or by interrogating the TrainedModelDefinition. I didn't want to break the API for existing users by requiring model_type is set since it can be set automatically.

The new class PyTorchModel implements TrainedModel and has a simple definition which is just the ID of the PyTorch model it uses. Loading the PyTorch model and checking it exists when the TrainedModelConfig is PUT is a TODO

elasticmachine · 2021-03-30T08:45:34Z

Pinging @elastic/ml-core (Team:ML)

davidkyle · 2021-03-30T08:47:10Z

...rc/main/java/org/elasticsearch/xpack/core/ml/inference/MlInferenceNamedXContentProvider.java

@@ -173,9 +180,12 @@
        // Model
        namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class, Tree.NAME.getPreferredName(), Tree::new));
        namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class, Ensemble.NAME.getPreferredName(), Ensemble::new));
-        namedWriteables.add(new NamedWriteableRegistry.Entry(LangIdentNeuralNetwork.class,
+        namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class,


Because the model definitions are always streamed as compressed strings we never lookup the named writable for TrainedModel

benwtrent · 2021-03-30T10:47:06Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/TrainedModelConfig.java

+//        if (ExceptionsHelper.requireNonNull(estimatedOperations, ESTIMATED_OPERATIONS) < 0) {
+//            throw new IllegalArgumentException("[" + ESTIMATED_OPERATIONS.getPreferredName() + "] must be greater than or equal to 0");
+//        }


Why is this commented out?

benwtrent · 2021-03-30T10:51:47Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/pytorch/PyTorchModel.java

+    public long estimatedNumOperations() {
+        return 0;
+    }
+
+    @Override
+    public long ramBytesUsed() {
+        return SHALLOW_SIZE;
+    }


I know this is temporary, but we will definitely need this populated in the future. This way we know if there is enough free resources to assign a model to the node.

Also, I suggest simply making estimatedNumOperations a 1 if we are not setting it for now.

I've put this on my TODO list. Yes the model will use memory but that is native memory not JVM, for the purpose of accounting the shallow size is right. Perhaps we add a long nativeRamBytesUsed() method for models loaded in a native process.

Before release this needs to be integrated with the MlMemoryTracker and NodeLoadDetector classes. They will need to track memory requirement for the 3 types of things we now have (anomaly detector jobs, data frame analytics jobs and native trained models), and then the node selectors will need to take into account the sum of all the requirements on each node.

I guess it's a non-trivial problem to know how much memory a PyTorch model will require when loaded given only the size of its .pt file to work with. We'll have to do some experiments and see if there's an approximate formula that gives reasonable results. We also need to account for the fact that the .pt file is held in the C++ process's heap while it loads the model, so total requirement will be sizeof(loaded model) + sizeof(.pt file) + sizeof(static overhead) + sizeof(code).

...c/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/pytorch/PyTorchModel.java

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/TrainedModelConfig.java

...l/src/main/java/org/elasticsearch/client/ml/inference/trainedmodel/pytorch/PyTorchModel.java

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPutTrainedModelAction.java

benwtrent · 2021-03-30T11:04:19Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/trained_model_cat_apis.yml

-        /^  id                       \s+  heap_size \s+ operations \s+ create_time \s+ ingest\.pipelines \s+ data_frame\.id \n
-           (a\-regression\-model\-0  \s+  \w+       \s+ \d+        \s+ .*?         \s+ \d+               \s+ .*?            \n)+        $/
+        /id\s+heap_size\s+operations\s+create_time\s+type\s+ingest\.pipelines\s+data_frame\.id\s*\n
+        (a\-regression\-model\-0\s+\w+\s+\d+\s+\S+\s+tree_ensemble\s+\d+\s+\w+\n)+$/


spaces are there for readability.

It may be more readable but it makes the test very brittle. I've found this is still quite readable as each column is delineated by a \s+ and the regex is easer to grok.

It may be more readable but it makes the test very brittle.

I don't understand how it makes the test more brittle. Tests don't fail more often or not due to the spaces and to me, it seems WAY easier to know what column my current regex clause is messing with when there are white spaces.

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/trained_model_cat_apis.yml

dimitris-athanasiou

LGTM2

This reverts commit 83fc8c98c7592690790157b9ad1a0c6e4e781.

…r functions

davidkyle · 2021-04-01T15:49:19Z

Adds the model_type field to TrainedModelConfig for distinguishing between models
that can be loaded via the model loading service and those that require a native process.

The feature branch contains changes to configure PyTorch models with a TrainedModelConfig and defines a format to store the binary models. The _start and _stop deployment actions control the model lifecycle and the model can be directly evaluated with the _infer endpoint. 2 Types of NLP tasks are supported: Named Entity Recognition and Fill Mask. The feature branch consists of these PRs: #73523, #72218, #71679 #71323, #71035, #71177, #70713

davidkyle added >feature :ml Machine learning labels Mar 30, 2021

elasticmachine added the Team:ML Meta label for the ML team label Mar 30, 2021

davidkyle mentioned this pull request Mar 30, 2021

[ML] Add PyTorch model configuration #71029

Closed

davidkyle commented Mar 30, 2021

View reviewed changes

davidkyle force-pushed the config branch from 9bc9cfb to b0b8ae5 Compare March 30, 2021 10:43

benwtrent reviewed Mar 30, 2021

View reviewed changes

benwtrent approved these changes Mar 31, 2021

View reviewed changes

dimitris-athanasiou approved these changes Apr 1, 2021

View reviewed changes

davidkyle added 9 commits April 1, 2021 13:59

Add PyTorch model configuration

a216f1a

checkstyle

f33edd2

fixing tests

be9024e

Review comments

f083b9f

more tests

bc72f31

Add endpoints to list of non-operator actions

69e8985

Revert "Add endpoints to list of non-operator actions"

acd5d2e

This reverts commit 83fc8c98c7592690790157b9ad1a0c6e4e781.

Make model deployment actions internal and add to list of non-operato…

b15f87a

…r functions

make actions admin

ebb5649

davidkyle force-pushed the config branch from 5ee47fc to ebb5649 Compare April 1, 2021 13:02

davidkyle changed the base branch from feature/pytorch-inference to master April 1, 2021 15:01

davidkyle changed the base branch from master to feature/pytorch-inference April 1, 2021 15:01

davidkyle merged commit 99ed8b0 into elastic:feature/pytorch-inference Apr 1, 2021

davidkyle deleted the config branch April 1, 2021 16:01

davidkyle mentioned this pull request Jun 2, 2021

[ML] Merge the pytorch-inference feature branch #73660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add PyTorch model configuration #71035

[ML] Add PyTorch model configuration #71035

davidkyle commented Mar 30, 2021

elasticmachine commented Mar 30, 2021

davidkyle Mar 30, 2021

benwtrent Mar 30, 2021

davidkyle Mar 30, 2021

benwtrent Mar 30, 2021

davidkyle Mar 30, 2021

droberts195 Mar 30, 2021

benwtrent Mar 30, 2021

davidkyle Mar 30, 2021

benwtrent Mar 31, 2021

dimitris-athanasiou left a comment

davidkyle commented Apr 1, 2021

[ML] Add PyTorch model configuration #71035

[ML] Add PyTorch model configuration #71035

Conversation

davidkyle commented Mar 30, 2021

elasticmachine commented Mar 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

davidkyle commented Apr 1, 2021