Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex phrase matcher #1312

Merged
merged 71 commits into from
Sep 11, 2018
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
803508b
working prototype of lookup section
twhughes Aug 13, 2018
c48b22b
ported to rasa_nlu training data
twhughes Aug 13, 2018
62ecff6
added tests
twhughes Aug 13, 2018
f5b8cbf
removed a print statement
twhughes Aug 13, 2018
0b3e267
added to docs
twhughes Aug 13, 2018
f34a588
replaced print() by logger.info()
twhughes Aug 13, 2018
b0d7d0e
fixed minor formatting issues
twhughes Aug 13, 2018
a8fe358
fixed pep errors
twhughes Aug 13, 2018
2f835f9
fixed some of the code climate warnings
twhughes Aug 13, 2018
9a65ccc
add lookup tables of all US street and cities
twhughes Aug 14, 2018
a158890
fixed pep8 errors
twhughes Aug 15, 2018
05e7c1b
explicitly remove empty regex strings
twhughes Aug 23, 2018
c64b71c
switch single quotes to double quotes
twhughes Aug 23, 2018
7823150
fixed a typo
twhughes Aug 23, 2018
4853ecb
removed empty lookup table elements by default
twhughes Aug 23, 2018
f5badd1
trying to fix codeclimate errors
twhughes Aug 23, 2018
6d53408
pesky codeclimate
twhughes Aug 23, 2018
14c9fbd
fixed new pep8 error
twhughes Aug 24, 2018
7c3bb39
Merge branch 'master' of https://github.com/rasaHQ/rasa_nlu into rege…
twhughes Aug 27, 2018
1bc3b19
Merge branch 'master' into regex_phrase_matcher
akelad Aug 29, 2018
fd7b446
Merge branch 'regex_phrase_matcher' of https://github.com/rasaHQ/rasa…
twhughes Aug 29, 2018
c4fee01
changed dataformat.rst
twhughes Aug 29, 2018
923307f
added warnings for large lookups. got rid of print_size arg
twhughes Aug 29, 2018
f4657c1
added word boundaries
twhughes Aug 29, 2018
4f2b51e
changed file IO stuff
twhughes Aug 29, 2018
0e6cdff
updated tests to have word boundaries
twhughes Aug 29, 2018
a9a809e
changed markdown example
twhughes Aug 29, 2018
450c615
fixed pep8 errors (line too long)
twhughes Aug 29, 2018
34473ac
fixed even more pep8 errors
twhughes Aug 29, 2018
94adcba
fixed more pep8 errors
twhughes Aug 29, 2018
a4e405d
added comment in dataformat example
twhughes Aug 30, 2018
16640f0
changed to debug logging
twhughes Aug 30, 2018
3cf1aed
made things newline-separated
twhughes Aug 30, 2018
30ba83d
Merge branch 'master' into regex_phrase_matcher
akelad Aug 30, 2018
ece424b
moved lookup table regex creation to regex featurizer
twhughes Aug 30, 2018
a49aed4
fixed akelas minor comments
twhughes Aug 30, 2018
bc465f3
changed docs
twhughes Aug 30, 2018
9c899a1
merge
twhughes Aug 30, 2018
a5e9e9e
merge
twhughes Aug 30, 2018
d744b50
removed unnecessary imports
twhughes Aug 30, 2018
adc4e7d
removed outdated training data test
twhughes Aug 30, 2018
2f22c31
fixed pep8
twhughes Aug 30, 2018
e9e5389
fixed pep8 again
twhughes Aug 30, 2018
255b857
blank line had whitespace..
twhughes Aug 30, 2018
669965c
added json direct elements
twhughes Sep 5, 2018
c4bef3d
working markdown
twhughes Sep 5, 2018
7772517
markdown works and passes tests:
twhughes Sep 5, 2018
1aed3dd
pep8
twhughes Sep 5, 2018
b291b9e
markdown writing
twhughes Sep 5, 2018
d0d7d91
merge
twhughes Sep 5, 2018
9144448
cleaning
twhughes Sep 5, 2018
cadff87
pep8 formatting
twhughes Sep 5, 2018
96d2082
spacing
twhughes Sep 5, 2018
6387f8c
added unicode checking for python2.7
twhughes Sep 6, 2018
a25553e
reverted some pep8 errors. got rid of unnecessary files
twhughes Sep 6, 2018
562d6a0
remove pytest cache
twhughes Sep 6, 2018
4d96f61
cleaned and made tests work
twhughes Sep 6, 2018
972add6
clear distinction between markdown and json in dataformat
twhughes Sep 11, 2018
fae32dd
added error handling for opening lookup table file
twhughes Sep 11, 2018
bf1e9e1
fixed indentation
twhughes Sep 11, 2018
2b5a674
got rid of multi-line comment
twhughes Sep 11, 2018
0f9f3d8
lookup_tables=None as default argument
twhughes Sep 11, 2018
dc3b90b
removed commented out pdb.set_trace()
twhughes Sep 11, 2018
9c59053
merged changelog
twhughes Sep 11, 2018
0bec88e
escape re special characters in lookup table elements added test
twhughes Sep 11, 2018
621ad79
Update CHANGELOG.rst
twhughes Sep 11, 2018
baadb4e
Update rasa.py
twhughes Sep 11, 2018
fa4448c
Update rasa.py
twhughes Sep 11, 2018
d64ce26
added suggestion on file exception
twhughes Sep 11, 2018
a4fe9bf
fixed syntax
twhughes Sep 11, 2018
06ced24
Merge branch 'regex_phrase_matcher' of https://github.com/rasaHQ/rasa…
twhughes Sep 11, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This project adheres to `Semantic Versioning`_ starting with version 0.7.0.

Added
-----
- ability to specify lookup tables in training data

Changed
-------
Expand Down
2 changes: 2 additions & 0 deletions data/test/lookup_tables/drinks.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mojito, lemonade, sweet berry wine
tea, club mate
21 changes: 21 additions & 0 deletions data/test/lookup_tables/lookup_table.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"rasa_nlu_data": {
"lookup_tables": [
{
"name": "plates",
"file_path": "data/test/lookup_tables/plates.txt"
},
{
"name": "drinks",
"file_path": "data/test/lookup_tables/drinks.txt"
}
],
"common_examples": [
{
"text": "hey",
"intent": "greet",
"entities": []
}
]
}
}
11 changes: 11 additions & 0 deletions data/test/lookup_tables/lookup_table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
## intent:restaurant_search
- i'm looking for a [sushi](food) place to eat
- I want to grab [tacos](food)
- I am searching for a [pizza](food) spot
- I would like to drink [sweet berry wine](beverage) with my meal

## lookup:plates
- data/test/lookup_tables/plates.txt

## lookup:drinks
- data/test/lookup_tables/drinks.txt
2 changes: 2 additions & 0 deletions data/test/lookup_tables/plates.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
tacos, beef, mapo tofu
burrito, lettuce wrap
36 changes: 35 additions & 1 deletion docs/dataformat.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,16 @@ Examples are grouped by intent, and entities are annotated as markdown links.
## regex:zipcode
- [0-9]{5}

## lookup:streets
- path/to/streets.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should maybe have a relevant intent example here, otherwise this lookup table makes no sense

Copy link
Contributor Author

@twhughes twhughes Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by this? Like a lookup table that would actually be used in the banking example? I updated it to a lookup table named accounts that has path path/to/accounts.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I meant is adding one/two lines of intent examples further up that would have some sentences where a lookup table could be relevant. I'm not sure it's relevant to the current example there is. So maybe add a new intent that would make good use of a lookup table. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense. I changed it to a lookup table of accounts names and added a comment

<!-- lookup table of account names for improving entity extraction (savings, checking, ...) -->

So now I think it is relevant without having to add more examples. For example the synonym pink pig in the example is also not directly relevant. Let me know if you agree.


The training data for Rasa NLU is structured into different parts:
examples, synonyms, and regex features.
examples, synonyms, regex features, and lookup tables.

Synonyms will map extracted entities to the same name, for example mapping "my savings account" to simply "savings".
However, this only happens *after* the entities have been extracted, so you need to provide examples with the synonyms present so that Rasa can learn to pick them up.

Lookup tables may be specified as txt files containing comma-separated words or phrases. Upon loading the training data, these files are used to generate case-insensitive regex patterns that are added to the regex features.

JSON Format
-----------
Expand All @@ -58,6 +61,7 @@ The most important one is ``common_examples``.
"rasa_nlu_data": {
"common_examples": [],
"regex_features" : [],
"lookup_tables" : [],
"entity_synonyms": []
}
}
Expand Down Expand Up @@ -230,6 +234,36 @@ for these extractors. Currently, all intent classifiers make use of available re
training data!


Lookup Tables
-------------
Lookup tables in the form of external files can also be specified in the training data. The externally supplied lookup tables must be in a comma-separated format. For example, ``data/lookup_tables/streets.txt`` may contain

main street, washington ave, elm street, ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than streets, maybe just include one of the test files (plates or drinks)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


And can be loaded in along with ``data/lookup_tables/cities.txt`` as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we really need to show how to load two, one should be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed


.. code-block:: json

{
"rasa_nlu_data": {
"lookup_tables": [
{
"name": "streets",
"file_path": "data/lookup_tables/streets.txt"
},
{
"name": "cities",
"file_path": "data/lookup_tables/cities.txt"
}
]
}
}

When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. These regexes match over multiple tokens, so ``main street`` would match ``meet me at 1223 main street at 5 pm`` as ``[0 0 0 0 1 1 0 0 0]``. These regexes are processed identically to the regular regex patterns directly specified in the training data. A few lookup tables for common entities are specified in ``rasa_nlu/data/lookups/``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait are they? I don't see a rasa_nlu/data/lookups/ 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working on them.. may add later. For now will remove this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not sure we'll add them to the NLU repo, or host them elsewhere tbh


.. note::
For lookup tables to be effective, there must be a few examples of matches in your training data. Otherwise the model will not learn to use the lookup table match features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably add a warning of some sort here not to add gigantic lookup tables, basically a short summary of what you mentioned in Slack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


Organization
------------

Expand Down
11 changes: 7 additions & 4 deletions rasa_nlu/training_data/formats/markdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,16 @@
import logging

from rasa_nlu.training_data import Message, TrainingData
from rasa_nlu.training_data.util import check_duplicate_synonym
from rasa_nlu.training_data.util import check_duplicate_synonym, generate_lookup_regex
from rasa_nlu.utils import build_entity

from rasa_nlu.training_data.formats.readerwriter import TrainingDataReader, TrainingDataWriter

INTENT = "intent"
SYNONYM = "synonym"
REGEX = "regex"
available_sections = [INTENT, SYNONYM, REGEX]
LOOKUP = "lookup"
available_sections = [INTENT, SYNONYM, REGEX, LOOKUP]
ent_regex = re.compile(r'\[(?P<entity_text>[^\]]+)'
r'\]\((?P<entity>\w*?)'
r'(?:\:(?P<value>[^)]+))?\)') # [entity_text](entity_type(:entity_synonym)?)
Expand Down Expand Up @@ -48,7 +49,6 @@ def reads(self, s, **kwargs):
self._set_current_section(header[0], header[1])
else:
self._parse_item(line)

return TrainingData(self.training_examples, self.entity_synonyms, self.regex_features)

@staticmethod
Expand Down Expand Up @@ -81,8 +81,11 @@ def _parse_item(self, line):
self.training_examples.append(parsed)
elif self.current_section == SYNONYM:
self._add_synonym(item, self.current_title)
else:
elif self.current_section == REGEX:
self.regex_features.append({"name": self.current_title, "pattern": item})
elif self.current_section == LOOKUP:
lookup_regex = generate_lookup_regex(item)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to append the lookup table to the training data object. if that object is dumped again, it will be missing the json entry for the lookup table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It appends it as a regex pattern, however, on the next line. So while the json would not have the same lookup table path, it would instead be still saved as a regex. Would it be more sense to keep the lookup table as is and do the conversion to regex in the regex featurizer instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhm so ideally we should be able to do this:

  • load from file
  • export loaded training data object to file
  • have the same file

so yes, if we can move that somewhere else, that would probably fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is important because otherwise the conversation between file format isn't seemless anymore (so converting a markdown file to json format would expand the lookup table into the regex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I think what I'll do is have training_data.lookup_tables be a list of file paths to lookup tables, the same way it's entered in the training data currently. Then, in the RegexFeaturizer I will just load the files, construct the regex patterns, and load these into RegexFeaturizer.known_patterns. Does this sound like a reasonable approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmbo ok I pushed some changes to implement this. 1. Updated the tests 2. Confirmed that it still retains the lookup table file_name format when dumping markdown or json. 3. Confirmed it works with train_test function.

self.regex_features.append({"name": self.current_title, "pattern": lookup_regex})

def _find_entities_in_training_example(self, example):
"""Extracts entities from a markdown intent example."""
Expand Down
20 changes: 19 additions & 1 deletion rasa_nlu/training_data/formats/rasa.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from rasa_nlu.training_data.formats.readerwriter import (
JsonTrainingDataReader,
TrainingDataWriter)
from rasa_nlu.training_data.util import transform_entity_synonyms
from rasa_nlu.training_data.util import transform_entity_synonyms, generate_lookup_regex
from rasa_nlu.utils import json_to_string

logger = logging.getLogger(__name__)
Expand All @@ -27,6 +27,12 @@ def read_from_json(self, js, **kwargs):
entity_examples = data.get("entity_examples", [])
entity_synonyms = data.get("entity_synonyms", [])
regex_features = data.get("regex_features", [])
lookup_tables = data.get("lookup_tables", [])

# generates regexes from lookup tables and adds to regex features
lookup_regexes = [{'name': t['name'],
'pattern': generate_lookup_regex(t['file_path'])} for t in lookup_tables]
regex_features += lookup_regexes

entity_synonyms = transform_entity_synonyms(entity_synonyms)

Expand Down Expand Up @@ -119,6 +125,14 @@ def _rasa_nlu_data_schema():
}
}

lookup_table_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"file_path": {"type": "string"},
}
}

return {
"type": "object",
"properties": {
Expand All @@ -140,6 +154,10 @@ def _rasa_nlu_data_schema():
"entity_examples": {
"type": "array",
"items": training_example_schema
},
"lookup_tables": {
"type": "array",
"items": lookup_table_schema
}
}
}
Expand Down
21 changes: 21 additions & 0 deletions rasa_nlu/training_data/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from __future__ import unicode_literals

import logging
import sys

logger = logging.getLogger(__name__)

Expand All @@ -24,3 +25,23 @@ def check_duplicate_synonym(entity_synonyms, text, syn, context_str=""):
if text in entity_synonyms and entity_synonyms[text] != syn:
logger.warning("Found inconsistent entity synonyms while {0}, overwriting {1}->{2}"
"with {1}->{2} during merge".format(context_str, text, entity_synonyms[text], syn))


def generate_lookup_regex(file_path, print_data_size=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print_data_size isn't used anywhere, please remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, good catch. Originally I had this as an optional functionality but now just print to logger regardless.

"""creates a regex out of the contents of a lookup table file"""
lookup_elements = []
with open(file_path, 'r') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io.open

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

for l in f.readlines():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for line in f:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

new_elements = [e.strip() for e in l.split(',')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm actually wondering if maybe we should just do a file where you only provide one word/phrase per line, rather than allowing new lines and comma separated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is now we can do either or a combination of both. I can't think of a good argument for restricting that. Figured that the user might have different delimiters and wanted to minimize the amount of pre-processing that needed to happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just think it's a bit odd haha, if we tell the user to do one thing it might make things less confusing. If we're leaving this the way it is, then we definitely need to mention in the docs that you can either separate by newline or by comma or both :P @tmbo would still like your opinion on this though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point of @twhughes, but I'd go for restricting it to newlines for the moment as well. If new use cases come up that require separation on other characters, we can still add it again.
The reason is that once we add it, we need to support it for the future, and I'd rather avoid needing to support features where we are unsure about the usefulness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright @twhughes once you've made that change i'll give this PR another review

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akelad ok I made the change to the code and docs. only accepts newline-separated lookups now (commas will be treated as part of the phrase)

if '' in new_elements:
new_elements.remove('')
lookup_elements += new_elements
regex_string = '(?i)(' + '|'.join(lookup_elements) + ')'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, did we not decide on using word boundaries?

Copy link
Contributor Author

@twhughes twhughes Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I hadn't pushed the change yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


"""log info about the lookup table"""
num_words = len(lookup_elements)
regex_size = sys.getsizeof(regex_string)
logger.info("found {} words in lookup table '{}'"
" with a size of {:.2e} bytes".format(num_words, file_path, regex_size))
twhughes marked this conversation as resolved.
Show resolved Hide resolved

return regex_string
16 changes: 16 additions & 0 deletions tests/base/test_training_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,22 @@ def test_markdown_single_sections():
'Chinese': 'chinese'}


def test_lookup_table_json():
td_lookup = training_data.load_data('data/test/lookup_tables/lookup_table.json')
assert td_lookup.regex_features[0]['name'] == 'drinks'
assert td_lookup.regex_features[0]['pattern'] == '(?i)(mojito|lemonade|sweet berry wine|tea|club mate)'
assert td_lookup.regex_features[1]['name'] == 'plates'
assert td_lookup.regex_features[1]['pattern'] == '(?i)(tacos|beef|mapo tofu|burrito|lettuce wrap)'


def test_lookup_table_md():
td_lookup = training_data.load_data('data/test/lookup_tables/lookup_table.md')
assert td_lookup.regex_features[0]['name'] == 'drinks'
assert td_lookup.regex_features[0]['pattern'] == '(?i)(mojito|lemonade|sweet berry wine|tea|club mate)'
assert td_lookup.regex_features[1]['name'] == 'plates'
assert td_lookup.regex_features[1]['pattern'] == '(?i)(tacos|beef|mapo tofu|burrito|lettuce wrap)'


def test_repeated_entities():
data = """
{
Expand Down