-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex phrase matcher #1312
Regex phrase matcher #1312
Changes from 19 commits
803508b
c48b22b
62ecff6
f5b8cbf
0b3e267
f34a588
b0d7d0e
a8fe358
2f835f9
9a65ccc
a158890
05e7c1b
c64b71c
7823150
4853ecb
f5badd1
6d53408
14c9fbd
7c3bb39
1bc3b19
fd7b446
c4fee01
923307f
f4657c1
4f2b51e
0e6cdff
a9a809e
450c615
34473ac
94adcba
a4e405d
16640f0
3cf1aed
30ba83d
ece424b
a49aed4
bc465f3
9c899a1
a5e9e9e
d744b50
adc4e7d
2f22c31
e9e5389
255b857
669965c
c4bef3d
7772517
1aed3dd
b291b9e
d0d7d91
9144448
cadff87
96d2082
6387f8c
a25553e
562d6a0
4d96f61
972add6
fae32dd
bf1e9e1
2b5a674
0f9f3d8
dc3b90b
9c59053
0bec88e
621ad79
baadb4e
fa4448c
d64ce26
a4fe9bf
06ced24
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
mojito, lemonade, sweet berry wine | ||
tea, club mate |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
{ | ||
"rasa_nlu_data": { | ||
"lookup_tables": [ | ||
{ | ||
"name": "plates", | ||
"file_path": "data/test/lookup_tables/plates.txt" | ||
}, | ||
{ | ||
"name": "drinks", | ||
"file_path": "data/test/lookup_tables/drinks.txt" | ||
} | ||
], | ||
"common_examples": [ | ||
{ | ||
"text": "hey", | ||
"intent": "greet", | ||
"entities": [] | ||
} | ||
] | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
## intent:restaurant_search | ||
- i'm looking for a [sushi](food) place to eat | ||
- I want to grab [tacos](food) | ||
- I am searching for a [pizza](food) spot | ||
- I would like to drink [sweet berry wine](beverage) with my meal | ||
|
||
## lookup:plates | ||
- data/test/lookup_tables/plates.txt | ||
|
||
## lookup:drinks | ||
- data/test/lookup_tables/drinks.txt |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
tacos, beef, mapo tofu | ||
burrito, lettuce wrap |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,13 +37,16 @@ Examples are grouped by intent, and entities are annotated as markdown links. | |
## regex:zipcode | ||
- [0-9]{5} | ||
|
||
## lookup:streets | ||
- path/to/streets.txt | ||
|
||
The training data for Rasa NLU is structured into different parts: | ||
examples, synonyms, and regex features. | ||
examples, synonyms, regex features, and lookup tables. | ||
|
||
Synonyms will map extracted entities to the same name, for example mapping "my savings account" to simply "savings". | ||
However, this only happens *after* the entities have been extracted, so you need to provide examples with the synonyms present so that Rasa can learn to pick them up. | ||
|
||
Lookup tables may be specified as txt files containing comma-separated words or phrases. Upon loading the training data, these files are used to generate case-insensitive regex patterns that are added to the regex features. | ||
|
||
JSON Format | ||
----------- | ||
|
@@ -58,6 +61,7 @@ The most important one is ``common_examples``. | |
"rasa_nlu_data": { | ||
"common_examples": [], | ||
"regex_features" : [], | ||
"lookup_tables" : [], | ||
"entity_synonyms": [] | ||
} | ||
} | ||
|
@@ -230,6 +234,36 @@ for these extractors. Currently, all intent classifiers make use of available re | |
training data! | ||
|
||
|
||
Lookup Tables | ||
------------- | ||
Lookup tables in the form of external files can also be specified in the training data. The externally supplied lookup tables must be in a comma-separated format. For example, ``data/lookup_tables/streets.txt`` may contain | ||
|
||
main street, washington ave, elm street, ... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than streets, maybe just include one of the test files (plates or drinks)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. |
||
|
||
And can be loaded in along with ``data/lookup_tables/cities.txt`` as: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure we really need to show how to load two, one should be enough There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, changed |
||
|
||
.. code-block:: json | ||
|
||
{ | ||
"rasa_nlu_data": { | ||
"lookup_tables": [ | ||
{ | ||
"name": "streets", | ||
"file_path": "data/lookup_tables/streets.txt" | ||
}, | ||
{ | ||
"name": "cities", | ||
"file_path": "data/lookup_tables/cities.txt" | ||
} | ||
] | ||
} | ||
} | ||
|
||
When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. These regexes match over multiple tokens, so ``main street`` would match ``meet me at 1223 main street at 5 pm`` as ``[0 0 0 0 1 1 0 0 0]``. These regexes are processed identically to the regular regex patterns directly specified in the training data. A few lookup tables for common entities are specified in ``rasa_nlu/data/lookups/`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait are they? I don't see a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm still working on them.. may add later. For now will remove this comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I'm not sure we'll add them to the NLU repo, or host them elsewhere tbh |
||
|
||
.. note:: | ||
For lookup tables to be effective, there must be a few examples of matches in your training data. Otherwise the model will not learn to use the lookup table match features. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should probably add a warning of some sort here not to add gigantic lookup tables, basically a short summary of what you mentioned in Slack There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
|
||
Organization | ||
------------ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,15 +7,16 @@ | |
import logging | ||
|
||
from rasa_nlu.training_data import Message, TrainingData | ||
from rasa_nlu.training_data.util import check_duplicate_synonym | ||
from rasa_nlu.training_data.util import check_duplicate_synonym, generate_lookup_regex | ||
from rasa_nlu.utils import build_entity | ||
|
||
from rasa_nlu.training_data.formats.readerwriter import TrainingDataReader, TrainingDataWriter | ||
|
||
INTENT = "intent" | ||
SYNONYM = "synonym" | ||
REGEX = "regex" | ||
available_sections = [INTENT, SYNONYM, REGEX] | ||
LOOKUP = "lookup" | ||
available_sections = [INTENT, SYNONYM, REGEX, LOOKUP] | ||
ent_regex = re.compile(r'\[(?P<entity_text>[^\]]+)' | ||
r'\]\((?P<entity>\w*?)' | ||
r'(?:\:(?P<value>[^)]+))?\)') # [entity_text](entity_type(:entity_synonym)?) | ||
|
@@ -48,7 +49,6 @@ def reads(self, s, **kwargs): | |
self._set_current_section(header[0], header[1]) | ||
else: | ||
self._parse_item(line) | ||
|
||
return TrainingData(self.training_examples, self.entity_synonyms, self.regex_features) | ||
|
||
@staticmethod | ||
|
@@ -81,8 +81,11 @@ def _parse_item(self, line): | |
self.training_examples.append(parsed) | ||
elif self.current_section == SYNONYM: | ||
self._add_synonym(item, self.current_title) | ||
else: | ||
elif self.current_section == REGEX: | ||
self.regex_features.append({"name": self.current_title, "pattern": item}) | ||
elif self.current_section == LOOKUP: | ||
lookup_regex = generate_lookup_regex(item) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this doesn't seem to append the lookup table to the training data object. if that object is dumped again, it will be missing the json entry for the lookup table There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. It appends it as a regex pattern, however, on the next line. So while the json would not have the same lookup table path, it would instead be still saved as a regex. Would it be more sense to keep the lookup table as is and do the conversion to regex in the regex featurizer instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mhm so ideally we should be able to do this:
so yes, if we can move that somewhere else, that would probably fix this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is important because otherwise the conversation between file format isn't seemless anymore (so converting a markdown file to json format would expand the lookup table into the regex) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok I think what I'll do is have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes 👍 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tmbo ok I pushed some changes to implement this. 1. Updated the tests 2. Confirmed that it still retains the lookup table file_name format when dumping markdown or json. 3. Confirmed it works with |
||
self.regex_features.append({"name": self.current_title, "pattern": lookup_regex}) | ||
|
||
def _find_entities_in_training_example(self, example): | ||
"""Extracts entities from a markdown intent example.""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ | |
from __future__ import unicode_literals | ||
|
||
import logging | ||
import sys | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
@@ -24,3 +25,23 @@ def check_duplicate_synonym(entity_synonyms, text, syn, context_str=""): | |
if text in entity_synonyms and entity_synonyms[text] != syn: | ||
logger.warning("Found inconsistent entity synonyms while {0}, overwriting {1}->{2}" | ||
"with {1}->{2} during merge".format(context_str, text, entity_synonyms[text], syn)) | ||
|
||
|
||
def generate_lookup_regex(file_path, print_data_size=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah yes, good catch. Originally I had this as an optional functionality but now just print to logger regardless. |
||
"""creates a regex out of the contents of a lookup table file""" | ||
lookup_elements = [] | ||
with open(file_path, 'r') as f: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. io.open There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
for l in f.readlines(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for line in f: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
new_elements = [e.strip() for e in l.split(',')] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i'm actually wondering if maybe we should just do a file where you only provide one word/phrase per line, rather than allowing new lines and comma separated? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way it is now we can do either or a combination of both. I can't think of a good argument for restricting that. Figured that the user might have different delimiters and wanted to minimize the amount of pre-processing that needed to happen. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I just think it's a bit odd haha, if we tell the user to do one thing it might make things less confusing. If we're leaving this the way it is, then we definitely need to mention in the docs that you can either separate by newline or by comma or both :P @tmbo would still like your opinion on this though There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see the point of @twhughes, but I'd go for restricting it to newlines for the moment as well. If new use cases come up that require separation on other characters, we can still add it again. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. alright @twhughes once you've made that change i'll give this PR another review There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @akelad ok I made the change to the code and docs. only accepts newline-separated lookups now (commas will be treated as part of the phrase) |
||
if '' in new_elements: | ||
new_elements.remove('') | ||
lookup_elements += new_elements | ||
regex_string = '(?i)(' + '|'.join(lookup_elements) + ')' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, did we not decide on using word boundaries? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I hadn't pushed the change yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
|
||
"""log info about the lookup table""" | ||
num_words = len(lookup_elements) | ||
regex_size = sys.getsizeof(regex_string) | ||
logger.info("found {} words in lookup table '{}'" | ||
" with a size of {:.2e} bytes".format(num_words, file_path, regex_size)) | ||
twhughes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
return regex_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should maybe have a relevant intent example here, otherwise this lookup table makes no sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by this? Like a lookup table that would actually be used in the banking example? I updated it to a lookup table named
accounts
that has pathpath/to/accounts.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I meant is adding one/two lines of intent examples further up that would have some sentences where a lookup table could be relevant. I'm not sure it's relevant to the current example there is. So maybe add a new intent that would make good use of a lookup table. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it makes sense. I changed it to a lookup table of accounts names and added a comment
<!-- lookup table of account names for improving entity extraction (savings, checking, ...) -->
So now I think it is relevant without having to add more examples. For example the synonym
pink pig
in the example is also not directly relevant. Let me know if you agree.