Skip to content

match_dict.json format

melisa-qordoba edited this page Sep 23, 2020 · 1 revision

match_dict.json format

Here is a minimal match_dict.json:

{
  "extract-revenge": {
    "patterns": [
      {
        "LEMMA": "extract",
        "TEMPLATE_ID": 1
      }
    ],
    "suggestions": [
      [
        {
          "TEXT": "exact",
          "FROM_TEMPLATE_ID": 1
        }
      ]
    ],
    "match_hook": [
      {
        "name": "succeeded_by_phrase",
        "args": "revenge",
        "match_if_predicate_is": true
      }
    ],
    "test": {
      "positive": [
        "And at the same time extract revenge on those he so despises?",
        "Watch as Tampa Bay extracts revenge against his former Los Angeles Rams team."
      ],
      "negative": ["Mother flavours her custards with lemon extract."]
    }
  }
}
  • The top-level key, extract-revenge must be unique (as must any dictionary key). The name is used as a unique identifier, but never shown.

  • The inner keys are as follows

    • patterns - A list of spaCy Matcher patterns (actually, a superset of a spaCy matcher pattern), which may look like e.g. [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]. The added syntax which makes it a superset is being able to add "TEMPLATE_ID": int to some of the dicts. This labels that part of the match as a template to be inflected, such as a verb to conjugate or a noun to pluralize. In the above example, we label the lemma extract as having TEMPLATE_ID of 1.
    • suggestions - a list of lists of dicts. The dicts have 1-2 keys:
      • just "TEXT" (str), which will be used in the suggestion,
      • just "PATTERN_REF" (int), which will copy the PATTERN_REF's token from the matched text,
      • both "TEXT": "sometext" and "FROM_TEMPLATE_ID": int, which will apply the conjugation/pluralization of the TEMPLATE_ID with value int to "TEXT". In the above example, suggestions is [[{"TEXT":"exact","FROM_TEMPLATE_ID":1}]], which means we will match the conjugation of exact to the conjugation of extracts, from the step above,
      • both "PATTERN_REF" (int) and "INFLECTION" (str), an explicit POS tag. Used when you want to reference the PATTERN_REF's token from the pattern, but conjugate to a different form (so far I have only seen this used for grammar rules). Example: {"PATTERN_REF": 1, "INFLECTION": "VBN"} will take the second token from the matched pattern and conjugate it into the past particible.
    • match_hook - (despite the singular name) A list of "match hooks". These are Python functions which refine matches. See the following section.
    • test - has positive and negative keys. positive is a list of strings which this rule SHOULD match against, negative is a list of strings which SHOULD NOT match. Used for testing now, but we have plans to infer rules from this section.
    • (optional) comment - a string for other humans to read; ignored by replaCy
    • (optional) anything - you can add any extra structure here, and replaCy will attempt to tag matching spans with this information using the spaCy custom extension attributes namespace span._ (spaCy docs). For example, you can add the key oogly with value "boogly" for the match "LOWER": "secret password". Then if you call span = rmatcher("This is the secret password.")[0], then span._.oogly == "boogly". replaCy tries to be cool about default values with user-defined extensions. If you have a match with the key-value pair "coolnes": 10, replaCy will infer that coolness is an int. When it adds coolness to all spaCy spans, it will make it so span._.coolness defaults to 0. This way, you can check all spans for if span._.coolness > THRESHOLD and not cause an AttributeError. You can change this the way you would change any spaCy custom attribute, e.g.
      from spacy.tokens import Span
    
      Span.set_extension("coolness", default=9000)

Between match hooks and custom span attributes, replaCy is incredibly powerful, and allows you to control your NLP application's behavior from a single JSON file.

Clone this wiki locally