Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Natural Language Processing tasks and models #73523

Merged
merged 21 commits into from
Jun 2, 2021

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented May 28, 2021

Following on from #72218 which defined how large PyTorch models can be stored, this PR introduces the concepts of Natural Language Processing tasks and defines a way to evaluate BERT models.

Mask Fill and Named Entity Recognition tasks are implemented here but others could be easily added now the framework is in place. In particular this PR implements tokenisation of input text for BERT models and defines a structure for post-graph processing.

Once the PyTorch model is uploaded a trained model config referencing it must be PUT

PUT ml/trained_models/bert-model-for-maskfill
{
    "description": "Mask fill model",
    "model_type": "pytorch",
    "inference_config": {
        "classification": {
            "num_top_classes": 1
        }
    },
    "input": {
        "field_names": ["text_field"]
    },
    "location": {
        "index": {
            "model_id": "bert-model-for-maskfill",
            "name": "big_model"
        }
    }
}

And the model deployed:

POST _ml/trained_models/deployment/bert-model-for-maskfill/_start

Mask Fill Example

POST _ml/trained_models/deployment/bert-model-for-maskfill/_infer
{
  "input": "Paris is the [MASK] of France."
}

Returns

[
  {
    "token" : "capital",
    "score" : 0.9861745037766138,
    "sequence" : "Paris is the capital of France."
  },
  {
    "token" : "center",
    "score" : 0.00372138405614492,
    "sequence" : "Paris is the center of France."
  },
  {
    "token" : "Capital",
    "score" : 0.003259749401778711,
    "sequence" : "Paris is the Capital of France."
  },
  {
    "token" : "centre",
    "score" : 0.002157122475609145,
    "sequence" : "Paris is the centre of France."
  },
  {
    "token" : "city",
    "score" : 9.026127599384262E-4,
    "sequence" : "Paris is the city of France."
  }
]

NER Example

POST _ml/trained_models/deployment/bert-model-fine-tuned-for-ner/_infer
{
  "input": "Today's GAH is live from Amsterdam, BC, London, Munich and Texas"
}

Returns:

[
  {
    "label" : "organisation",
    "score" : 0.940775243737086,
    "word" : "GAH"
  },
  {
    "label" : "location",
    "score" : 0.9987588832004948,
    "word" : "Amsterdam"
  },
  {
    "label" : "location",
    "score" : 0.9958452874139202,
    "word" : "BC"
  },
  {
    "label" : "location",
    "score" : 0.9981461858828271,
    "word" : "London"
  },
  {
    "label" : "location",
    "score" : 0.9991212183928049,
    "word" : "Munich"
  },
  {
    "label" : "location",
    "score" : 0.9994121461792658,
    "word" : "Texas"
  }
]

Feature branch PR

Co-authored-by: Dimitris Athanasiou [email protected]

@davidkyle davidkyle added >feature :ml Machine learning labels May 28, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label May 28, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Member

run elasticsearch-ci/part-1

@mark-vieira
Copy link
Contributor

jenkins test this please

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple of test related comments.

import org.elasticsearch.test.ESTestCase;


public class TaskTypeTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is left empty. Should we add some tests here?

import java.util.List;
import java.util.stream.Collectors;

public class FillMaskProcessor implements NlpTask.Processor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some tests for this one

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Just a question about the name of the fill mask results field. Good to merge though even if you decide to change that.

@@ -26,25 +26,25 @@
public static final String NAME = "fill_mask_result";
public static final String DEFAULT_RESULTS_FIELD = "results";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be predictions?

@@ -27,40 +26,49 @@
this.bertRequestBuilder = new BertRequestBuilder(tokenizer);
}

@Override
public void validateInputs(String inputs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@davidkyle davidkyle merged commit 8e51034 into elastic:feature/pytorch-inference Jun 2, 2021
@davidkyle davidkyle deleted the bert-tokenizer branch June 2, 2021 10:13
davidkyle added a commit that referenced this pull request Jun 3, 2021
The feature branch contains changes to configure PyTorch models with a 
TrainedModelConfig and defines a format to store the binary models. 
The _start and _stop deployment actions control the model lifecycle 
and the model can be directly evaluated with the _infer endpoint. 
2 Types of NLP tasks are supported: Named Entity Recognition and Fill Mask.

The feature branch consists of these PRs: #73523, #72218, #71679
#71323, #71035, #71177, #70713
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants