[ML] Natural Language Processing tasks and models #73523

davidkyle · 2021-05-28T11:39:12Z

Following on from #72218 which defined how large PyTorch models can be stored, this PR introduces the concepts of Natural Language Processing tasks and defines a way to evaluate BERT models.

Mask Fill and Named Entity Recognition tasks are implemented here but others could be easily added now the framework is in place. In particular this PR implements tokenisation of input text for BERT models and defines a structure for post-graph processing.

Once the PyTorch model is uploaded a trained model config referencing it must be PUT

PUT ml/trained_models/bert-model-for-maskfill
{
    "description": "Mask fill model",
    "model_type": "pytorch",
    "inference_config": {
        "classification": {
            "num_top_classes": 1
        }
    },
    "input": {
        "field_names": ["text_field"]
    },
    "location": {
        "index": {
            "model_id": "bert-model-for-maskfill",
            "name": "big_model"
        }
    }
}

And the model deployed:

POST _ml/trained_models/deployment/bert-model-for-maskfill/_start

Mask Fill Example

POST _ml/trained_models/deployment/bert-model-for-maskfill/_infer
{
  "input": "Paris is the [MASK] of France."
}

Returns

[
  {
    "token" : "capital",
    "score" : 0.9861745037766138,
    "sequence" : "Paris is the capital of France."
  },
  {
    "token" : "center",
    "score" : 0.00372138405614492,
    "sequence" : "Paris is the center of France."
  },
  {
    "token" : "Capital",
    "score" : 0.003259749401778711,
    "sequence" : "Paris is the Capital of France."
  },
  {
    "token" : "centre",
    "score" : 0.002157122475609145,
    "sequence" : "Paris is the centre of France."
  },
  {
    "token" : "city",
    "score" : 9.026127599384262E-4,
    "sequence" : "Paris is the city of France."
  }
]

NER Example

POST _ml/trained_models/deployment/bert-model-fine-tuned-for-ner/_infer
{
  "input": "Today's GAH is live from Amsterdam, BC, London, Munich and Texas"
}

Returns:

[
  {
    "label" : "organisation",
    "score" : 0.940775243737086,
    "word" : "GAH"
  },
  {
    "label" : "location",
    "score" : 0.9987588832004948,
    "word" : "Amsterdam"
  },
  {
    "label" : "location",
    "score" : 0.9958452874139202,
    "word" : "BC"
  },
  {
    "label" : "location",
    "score" : 0.9981461858828271,
    "word" : "London"
  },
  {
    "label" : "location",
    "score" : 0.9991212183928049,
    "word" : "Munich"
  },
  {
    "label" : "location",
    "score" : 0.9994121461792658,
    "word" : "Texas"
  }
]

Feature branch PR

Co-authored-by: Dimitris Athanasiou [email protected]

elasticmachine · 2021-05-28T11:39:16Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-05-28T11:53:48Z

run elasticsearch-ci/part-1

mark-vieira · 2021-05-28T16:21:35Z

jenkins test this please

dimitris-athanasiou

Looks good. Just a couple of test related comments.

dimitris-athanasiou · 2021-06-01T11:26:56Z

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/inference/nlp/TaskTypeTests.java

+import org.elasticsearch.test.ESTestCase;
+
+
+public class TaskTypeTests extends ESTestCase {


This one is left empty. Should we add some tests here?

dimitris-athanasiou · 2021-06-01T11:27:37Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/FillMaskProcessor.java

+import java.util.List;
+import java.util.stream.Collectors;
+
+public class FillMaskProcessor implements NlpTask.Processor {


We should add some tests for this one

dimitris-athanasiou

LGTM Just a question about the name of the fill mask results field. Good to merge though even if you decide to change that.

dimitris-athanasiou · 2021-06-01T16:08:20Z

...in/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/results/FillMaskResults.java

@@ -26,25 +26,25 @@
    public static final String NAME = "fill_mask_result";
    public static final String DEFAULT_RESULTS_FIELD = "results";


Should this also be predictions?

dimitris-athanasiou · 2021-06-01T16:09:31Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/FillMaskProcessor.java

@@ -27,40 +26,49 @@
        this.bertRequestBuilder = new BertRequestBuilder(tokenizer);
    }

+    @Override
+    public void validateInputs(String inputs) {


The feature branch contains changes to configure PyTorch models with a TrainedModelConfig and defines a format to store the binary models. The _start and _stop deployment actions control the model lifecycle and the model can be directly evaluated with the _infer endpoint. 2 Types of NLP tasks are supported: Named Entity Recognition and Fill Mask. The feature branch consists of these PRs: #73523, #72218, #71679 #71323, #71035, #71177, #70713

davidkyle added >feature :ml Machine learning labels May 28, 2021

elasticmachine added the Team:ML Meta label for the ML team label May 28, 2021

davidkyle force-pushed the bert-tokenizer branch from 45c44a4 to 2b5b0e2 Compare June 1, 2021 08:10

davidkyle mentioned this pull request Jun 1, 2021

[ML] Add Trained Model Post-Processors #69571

Closed

dimitris-athanasiou reviewed Jun 1, 2021

View reviewed changes

dimitris-athanasiou approved these changes Jun 1, 2021

View reviewed changes

davidkyle and others added 20 commits June 1, 2021 21:23

WIP

e3fafa2

Add the tokenization pipeline

7b74033

Pass 'inputs' to infer request instead of the big whole doc

ea665e4

Add special tokens and do_lower_case setting

d38a054

Add pipeline post processor

558ce9f

Fixing tests

568fabd

Implement NER result processor

8ce7ede

Add fill_mask processor

f3aef86

Move results into core and add tests

5edff63

Drop Pipeline terminology

b76f14e

Remove big config file

5d2491f

Use a common BERT request builder

4b26720

Add top k function

9aa0457

Handle punctuation chars next to the [MASK] token

0f0424b

Ner Processor tests

bc050ae

tidy up

1788374

Heap based top k

92c4123

Implement top k using a priority queue

b7a4a7f

Fixes

b744aa3

Fill Mask test

5abec25

Check for error from pytorch results

ff2a6c1

davidkyle force-pushed the bert-tokenizer branch from 199e247 to ff2a6c1 Compare June 1, 2021 20:24

davidkyle merged commit 8e51034 into elastic:feature/pytorch-inference Jun 2, 2021

davidkyle deleted the bert-tokenizer branch June 2, 2021 10:13

davidkyle mentioned this pull request Jun 2, 2021

[ML] Merge the pytorch-inference feature branch #73660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Natural Language Processing tasks and models #73523

[ML] Natural Language Processing tasks and models #73523

davidkyle commented May 28, 2021 •

edited by dimitris-athanasiou

Loading

elasticmachine commented May 28, 2021

benwtrent commented May 28, 2021

mark-vieira commented May 28, 2021

dimitris-athanasiou left a comment

dimitris-athanasiou Jun 1, 2021

dimitris-athanasiou Jun 1, 2021

dimitris-athanasiou left a comment

dimitris-athanasiou Jun 1, 2021

dimitris-athanasiou Jun 1, 2021

		import org.elasticsearch.test.ESTestCase;


		public class TaskTypeTests extends ESTestCase {

		@@ -26,25 +26,25 @@
		public static final String NAME = "fill_mask_result";
		public static final String DEFAULT_RESULTS_FIELD = "results";

[ML] Natural Language Processing tasks and models #73523

[ML] Natural Language Processing tasks and models #73523

Conversation

davidkyle commented May 28, 2021 • edited by dimitris-athanasiou Loading

Mask Fill Example

NER Example

elasticmachine commented May 28, 2021

benwtrent commented May 28, 2021

mark-vieira commented May 28, 2021

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

dimitris-athanasiou Jun 1, 2021

Choose a reason for hiding this comment

dimitris-athanasiou Jun 1, 2021

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

dimitris-athanasiou Jun 1, 2021

Choose a reason for hiding this comment

dimitris-athanasiou Jun 1, 2021

Choose a reason for hiding this comment

davidkyle commented May 28, 2021 •

edited by dimitris-athanasiou

Loading