-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex phrase matcher #1312
Regex phrase matcher #1312
Conversation
…x_phrase_matcher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few things, the main one being how we should structure the lookup table. In my opinion it's better to have a word/phrase per line, rather than comma and new line separation. @tmbo your input on this is welcome as well
docs/dataformat.rst
Outdated
@@ -37,13 +37,16 @@ Examples are grouped by intent, and entities are annotated as markdown links. | |||
## regex:zipcode | |||
- [0-9]{5} | |||
|
|||
## lookup:streets | |||
- path/to/streets.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should maybe have a relevant intent example here, otherwise this lookup table makes no sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by this? Like a lookup table that would actually be used in the banking example? I updated it to a lookup table named accounts
that has path path/to/accounts.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I meant is adding one/two lines of intent examples further up that would have some sentences where a lookup table could be relevant. I'm not sure it's relevant to the current example there is. So maybe add a new intent that would make good use of a lookup table. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it makes sense. I changed it to a lookup table of accounts names and added a comment
<!-- lookup table of account names for improving entity extraction (savings, checking, ...) -->
So now I think it is relevant without having to add more examples. For example the synonym pink pig
in the example is also not directly relevant. Let me know if you agree.
docs/dataformat.rst
Outdated
------------- | ||
Lookup tables in the form of external files can also be specified in the training data. The externally supplied lookup tables must be in a comma-separated format. For example, ``data/lookup_tables/streets.txt`` may contain | ||
|
||
main street, washington ave, elm street, ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than streets, maybe just include one of the test files (plates or drinks)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
docs/dataformat.rst
Outdated
|
||
main street, washington ave, elm street, ... | ||
|
||
And can be loaded in along with ``data/lookup_tables/cities.txt`` as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we really need to show how to load two, one should be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, changed
docs/dataformat.rst
Outdated
} | ||
} | ||
|
||
When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. These regexes match over multiple tokens, so ``main street`` would match ``meet me at 1223 main street at 5 pm`` as ``[0 0 0 0 1 1 0 0 0]``. These regexes are processed identically to the regular regex patterns directly specified in the training data. A few lookup tables for common entities are specified in ``rasa_nlu/data/lookups/`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait are they? I don't see a rasa_nlu/data/lookups/
😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still working on them.. may add later. For now will remove this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not sure we'll add them to the NLU repo, or host them elsewhere tbh
rasa_nlu/training_data/util.py
Outdated
@@ -24,3 +25,23 @@ def check_duplicate_synonym(entity_synonyms, text, syn, context_str=""): | |||
if text in entity_synonyms and entity_synonyms[text] != syn: | |||
logger.warning("Found inconsistent entity synonyms while {0}, overwriting {1}->{2}" | |||
"with {1}->{2} during merge".format(context_str, text, entity_synonyms[text], syn)) | |||
|
|||
|
|||
def generate_lookup_regex(file_path, print_data_size=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print_data_size
isn't used anywhere, please remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, good catch. Originally I had this as an optional functionality but now just print to logger regardless.
rasa_nlu/training_data/util.py
Outdated
if '' in new_elements: | ||
new_elements.remove('') | ||
lookup_elements += new_elements | ||
regex_string = '(?i)(' + '|'.join(lookup_elements) + ')' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, did we not decide on using word boundaries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I hadn't pushed the change yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
rasa_nlu/training_data/util.py
Outdated
def generate_lookup_regex(file_path, print_data_size=True): | ||
"""creates a regex out of the contents of a lookup table file""" | ||
lookup_elements = [] | ||
with open(file_path, 'r') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
io.open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
rasa_nlu/training_data/util.py
Outdated
"""creates a regex out of the contents of a lookup table file""" | ||
lookup_elements = [] | ||
with open(file_path, 'r') as f: | ||
for l in f.readlines(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for line in f:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
rasa_nlu/training_data/util.py
Outdated
lookup_elements = [] | ||
with open(file_path, 'r') as f: | ||
for l in f.readlines(): | ||
new_elements = [e.strip() for e in l.split(',')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm actually wondering if maybe we should just do a file where you only provide one word/phrase per line, rather than allowing new lines and comma separated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it is now we can do either or a combination of both. I can't think of a good argument for restricting that. Figured that the user might have different delimiters and wanted to minimize the amount of pre-processing that needed to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just think it's a bit odd haha, if we tell the user to do one thing it might make things less confusing. If we're leaving this the way it is, then we definitely need to mention in the docs that you can either separate by newline or by comma or both :P @tmbo would still like your opinion on this though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point of @twhughes, but I'd go for restricting it to newlines for the moment as well. If new use cases come up that require separation on other characters, we can still add it again.
The reason is that once we add it, we need to support it for the future, and I'd rather avoid needing to support features where we are unsure about the usefulness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright @twhughes once you've made that change i'll give this PR another review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akelad ok I made the change to the code and docs. only accepts newline-separated lookups now (commas will be treated as part of the phrase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, just two minor things, and let's see what @tmbo says
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, the only thing you need to look at is escaping the values before building the regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one small thing about the error message, then we can merge!
try: | ||
f = io.open(lookup_elements, 'r') | ||
except IOError: | ||
raise ValueError("Could not load lookup table {}".format(lookup_elements)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you change this to "Could not load lookup table {}. Make sure you've provided the correct path"
Great work 🚀 |
…asaHQ#1312) Bumps [github.com/mattn/go-isatty](https://github.com/mattn/go-isatty) from 0.0.14 to 0.0.16. - [Release notes](https://github.com/mattn/go-isatty/releases) - [Commits](mattn/go-isatty@v0.0.14...v0.0.16) --- updated-dependencies: - dependency-name: github.com/mattn/go-isatty dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Proposed changes:
Lookup tables may now be specified in the training data. Individual lookup elements may be included directly as lists of strings. Alternatively the externally supplied lookup tables may be specified in the form of external files separated by newlines.
For example
or in markdown format:
External data files may be supplied as well. For example,
data/lookup_tables/streets.txt
may containAnd can be loaded in along with additional elements as:
or, equivalently, in markdown format as:
When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. The regex will only match phrases that are surrounded by word boundaries, such as spaces, newlines, commas, periods, etc.
These regexes match over multiple tokens, so if
main street
was specified in the lookup table, this would match the tokens ofmeet me at 1223 main street
as0 0 0 0 1 1
.The generated regexes are processed identically to the regular regex patterns directly specified in the training data.
Status (please check what you already did):