-
Notifications
You must be signed in to change notification settings - Fork 8
match_dict.json format
melisa-qordoba edited this page Sep 23, 2020
·
1 revision
Here is a minimal match_dict.json
:
{
"extract-revenge": {
"patterns": [
{
"LEMMA": "extract",
"TEMPLATE_ID": 1
}
],
"suggestions": [
[
{
"TEXT": "exact",
"FROM_TEMPLATE_ID": 1
}
]
],
"match_hook": [
{
"name": "succeeded_by_phrase",
"args": "revenge",
"match_if_predicate_is": true
}
],
"test": {
"positive": [
"And at the same time extract revenge on those he so despises?",
"Watch as Tampa Bay extracts revenge against his former Los Angeles Rams team."
],
"negative": ["Mother flavours her custards with lemon extract."]
}
}
}
-
The top-level key,
extract-revenge
must be unique (as must any dictionary key). The name is used as a unique identifier, but never shown. -
The inner keys are as follows
-
patterns
- A list of spaCy Matcher patterns (actually, a superset of a spaCy matcher pattern), which may look like e.g.[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
. The added syntax which makes it a superset is being able to add"TEMPLATE_ID": int
to some of the dicts. This labels that part of the match as a template to be inflected, such as a verb to conjugate or a noun to pluralize. In the above example, we label the lemmaextract
as havingTEMPLATE_ID
of1
. -
suggestions
- a list of lists of dicts. The dicts have 1-2 keys:- just
"TEXT" (str)
, which will be used in the suggestion, - just
"PATTERN_REF" (int)
, which will copy thePATTERN_REF
's token from the matched text, - both
"TEXT": "sometext"
and"FROM_TEMPLATE_ID": int
, which will apply the conjugation/pluralization of theTEMPLATE_ID
with valueint
to"TEXT"
. In the above example, suggestions is[[{"TEXT":"exact","FROM_TEMPLATE_ID":1}]]
, which means we will match the conjugation ofexact
to the conjugation ofextracts
, from the step above, - both
"PATTERN_REF" (int)
and"INFLECTION" (str)
, an explicit POS tag. Used when you want to reference thePATTERN_REF
's token from the pattern, but conjugate to a different form (so far I have only seen this used for grammar rules). Example:{"PATTERN_REF": 1, "INFLECTION": "VBN"}
will take the second token from the matched pattern and conjugate it into the past particible.
- just
-
match_hook
- (despite the singular name) A list of "match hooks". These are Python functions which refine matches. See the following section. -
test
- haspositive
andnegative
keys.positive
is a list of strings which this rule SHOULD match against,negative
is a list of strings which SHOULD NOT match. Used for testing now, but we have plans to infer rules from this section. - (optional)
comment
- a string for other humans to read; ignored by replaCy - (optional)
anything
- you can add any extra structure here, and replaCy will attempt to tag matching spans with this information using the spaCy custom extension attributes namespacespan._
(spaCy docs). For example, you can add the keyoogly
with value"boogly"
for the match"LOWER": "secret password"
. Then if you callspan = rmatcher("This is the secret password.")[0]
, thenspan._.oogly == "boogly"
. replaCy tries to be cool about default values with user-defined extensions. If you have a match with the key-value pair"coolnes": 10
, replaCy will infer thatcoolness
is anint
. When it addscoolness
to all spaCy spans, it will make it sospan._.coolness
defaults to0
. This way, you can check all spans forif span._.coolness > THRESHOLD
and not cause anAttributeError
. You can change this the way you would change any spaCy custom attribute, e.g.
from spacy.tokens import Span Span.set_extension("coolness", default=9000)
-
Between match hooks and custom span attributes, replaCy is incredibly powerful, and allows you to control your NLP application's behavior from a single JSON file.