Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add PyTorch model configuration #71035

Merged
merged 9 commits into from
Apr 1, 2021

Conversation

davidkyle
Copy link
Member

Adds the model_type field to TrainedModelConfig for distinguishing between models that can be loaded via the model loading service and those that require a native process. model_type now appears in the CAT trained models action.

Existing models without a model_type must be either a tree ensemble or the land ident model. I've added the field to the lang ident config so all models with a null model_type must be a tree ensemble. model_type is set on creation of new models either by the user or by interrogating the TrainedModelDefinition. I didn't want to break the API for existing users by requiring model_type is set since it can be set automatically.

The new class PyTorchModel implements TrainedModel and has a simple definition which is just the ID of the PyTorch model it uses. Loading the PyTorch model and checking it exists when the TrainedModelConfig is PUT is a TODO

@davidkyle davidkyle added >feature :ml Machine learning labels Mar 30, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@@ -173,9 +180,12 @@
// Model
namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class, Tree.NAME.getPreferredName(), Tree::new));
namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class, Ensemble.NAME.getPreferredName(), Ensemble::new));
namedWriteables.add(new NamedWriteableRegistry.Entry(LangIdentNeuralNetwork.class,
namedWriteables.add(new NamedWriteableRegistry.Entry(TrainedModel.class,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the model definitions are always streamed as compressed strings we never lookup the named writable for TrainedModel

Comment on lines 174 to 176
// if (ExceptionsHelper.requireNonNull(estimatedOperations, ESTIMATED_OPERATIONS) < 0) {
// throw new IllegalArgumentException("[" + ESTIMATED_OPERATIONS.getPreferredName() + "] must be greater than or equal to 0");
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷

Comment on lines +85 to +92
public long estimatedNumOperations() {
return 0;
}

@Override
public long ramBytesUsed() {
return SHALLOW_SIZE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is temporary, but we will definitely need this populated in the future. This way we know if there is enough free resources to assign a model to the node.

Also, I suggest simply making estimatedNumOperations a 1 if we are not setting it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put this on my TODO list. Yes the model will use memory but that is native memory not JVM, for the purpose of accounting the shallow size is right. Perhaps we add a long nativeRamBytesUsed() method for models loaded in a native process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before release this needs to be integrated with the MlMemoryTracker and NodeLoadDetector classes. They will need to track memory requirement for the 3 types of things we now have (anomaly detector jobs, data frame analytics jobs and native trained models), and then the node selectors will need to take into account the sum of all the requirements on each node.

I guess it's a non-trivial problem to know how much memory a PyTorch model will require when loaded given only the size of its .pt file to work with. We'll have to do some experiments and see if there's an approximate formula that gives reasonable results. We also need to account for the fact that the .pt file is held in the C++ process's heap while it loads the model, so total requirement will be sizeof(loaded model) + sizeof(.pt file) + sizeof(static overhead) + sizeof(code).

Comment on lines 90 to 92
/^ id \s+ heap_size \s+ operations \s+ create_time \s+ ingest\.pipelines \s+ data_frame\.id \n
(a\-regression\-model\-0 \s+ \w+ \s+ \d+ \s+ .*? \s+ \d+ \s+ .*? \n)+ $/
/id\s+heap_size\s+operations\s+create_time\s+type\s+ingest\.pipelines\s+data_frame\.id\s*\n
(a\-regression\-model\-0\s+\w+\s+\d+\s+\S+\s+tree_ensemble\s+\d+\s+\w+\n)+$/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces are there for readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more readable but it makes the test very brittle. I've found this is still quite readable as each column is delineated by a \s+ and the regex is easer to grok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more readable but it makes the test very brittle.

I don't understand how it makes the test more brittle. Tests don't fail more often or not due to the spaces and to me, it seems WAY easier to know what column my current regex clause is messing with when there are white spaces.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@davidkyle davidkyle changed the base branch from feature/pytorch-inference to master April 1, 2021 15:01
@davidkyle davidkyle changed the base branch from master to feature/pytorch-inference April 1, 2021 15:01
@davidkyle
Copy link
Member Author

Adds the model_type field to TrainedModelConfig for distinguishing between models
that can be loaded via the model loading service and those that require a native process.

@davidkyle davidkyle merged commit 99ed8b0 into elastic:feature/pytorch-inference Apr 1, 2021
@davidkyle davidkyle deleted the config branch April 1, 2021 16:01
davidkyle added a commit that referenced this pull request Jun 3, 2021
The feature branch contains changes to configure PyTorch models with a 
TrainedModelConfig and defines a format to store the binary models. 
The _start and _stop deployment actions control the model lifecycle 
and the model can be directly evaluated with the _infer endpoint. 
2 Types of NLP tasks are supported: Named Entity Recognition and Fill Mask.

The feature branch consists of these PRs: #73523, #72218, #71679
#71323, #71035, #71177, #70713
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants