Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex phrase matcher #1312

Merged
merged 71 commits into from
Sep 11, 2018
Merged

Regex phrase matcher #1312

merged 71 commits into from
Sep 11, 2018

Conversation

twhughes
Copy link
Contributor

@twhughes twhughes commented Aug 13, 2018

Proposed changes:

Lookup tables may now be specified in the training data. Individual lookup elements may be included directly as lists of strings. Alternatively the externally supplied lookup tables may be specified in the form of external files separated by newlines.

For example

{
    "rasa_nlu_data": {
        "lookup_tables": [
            {
                "name": "streets",
                "elements":  ["main street", "washington ave", "elm street", "rocky road"]
            }
        ]
    }
}

or in markdown format:

  ## lookup:streets
    - main street
    - washington ave
    - elm street
    - rocky road

External data files may be supplied as well. For example, data/lookup_tables/streets.txt may contain

main street
washington ave
elm street
rocky road

And can be loaded in along with additional elements as:

{
    "rasa_nlu_data": {
        "lookup_tables": [
            {
                "name": "streets",
                "elements":  "data/lookup_tables/streets.txt"
            }
        ]
    }
}

or, equivalently, in markdown format as:

  ## lookup:streets
  data/lookup_tables/streets.txt

When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. The regex will only match phrases that are surrounded by word boundaries, such as spaces, newlines, commas, periods, etc.

These regexes match over multiple tokens, so if main street was specified in the lookup table, this would match the tokens of meet me at 1223 main street as 0 0 0 0 1 1.

The generated regexes are processed identically to the regular regex patterns directly specified in the training data.

Status (please check what you already did):

  • made PR ready for code review
  • added some tests for the functionality
  • updated the documentation
  • updated the changelog

@twhughes twhughes requested a review from tmbo August 13, 2018 12:32
@twhughes twhughes requested a review from amn41 August 13, 2018 12:41
@tmbo tmbo requested review from akelad and removed request for amn41 August 27, 2018 14:51
Copy link
Contributor

@akelad akelad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things, the main one being how we should structure the lookup table. In my opinion it's better to have a word/phrase per line, rather than comma and new line separation. @tmbo your input on this is welcome as well

@@ -37,13 +37,16 @@ Examples are grouped by intent, and entities are annotated as markdown links.
## regex:zipcode
- [0-9]{5}

## lookup:streets
- path/to/streets.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should maybe have a relevant intent example here, otherwise this lookup table makes no sense

Copy link
Contributor Author

@twhughes twhughes Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by this? Like a lookup table that would actually be used in the banking example? I updated it to a lookup table named accounts that has path path/to/accounts.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I meant is adding one/two lines of intent examples further up that would have some sentences where a lookup table could be relevant. I'm not sure it's relevant to the current example there is. So maybe add a new intent that would make good use of a lookup table. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense. I changed it to a lookup table of accounts names and added a comment

<!-- lookup table of account names for improving entity extraction (savings, checking, ...) -->

So now I think it is relevant without having to add more examples. For example the synonym pink pig in the example is also not directly relevant. Let me know if you agree.

-------------
Lookup tables in the form of external files can also be specified in the training data. The externally supplied lookup tables must be in a comma-separated format. For example, ``data/lookup_tables/streets.txt`` may contain

main street, washington ave, elm street, ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than streets, maybe just include one of the test files (plates or drinks)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


main street, washington ave, elm street, ...

And can be loaded in along with ``data/lookup_tables/cities.txt`` as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we really need to show how to load two, one should be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed

}
}

When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. These regexes match over multiple tokens, so ``main street`` would match ``meet me at 1223 main street at 5 pm`` as ``[0 0 0 0 1 1 0 0 0]``. These regexes are processed identically to the regular regex patterns directly specified in the training data. A few lookup tables for common entities are specified in ``rasa_nlu/data/lookups/``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait are they? I don't see a rasa_nlu/data/lookups/ 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working on them.. may add later. For now will remove this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not sure we'll add them to the NLU repo, or host them elsewhere tbh

@@ -24,3 +25,23 @@ def check_duplicate_synonym(entity_synonyms, text, syn, context_str=""):
if text in entity_synonyms and entity_synonyms[text] != syn:
logger.warning("Found inconsistent entity synonyms while {0}, overwriting {1}->{2}"
"with {1}->{2} during merge".format(context_str, text, entity_synonyms[text], syn))


def generate_lookup_regex(file_path, print_data_size=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print_data_size isn't used anywhere, please remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, good catch. Originally I had this as an optional functionality but now just print to logger regardless.

if '' in new_elements:
new_elements.remove('')
lookup_elements += new_elements
regex_string = '(?i)(' + '|'.join(lookup_elements) + ')'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, did we not decide on using word boundaries?

Copy link
Contributor Author

@twhughes twhughes Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I hadn't pushed the change yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def generate_lookup_regex(file_path, print_data_size=True):
"""creates a regex out of the contents of a lookup table file"""
lookup_elements = []
with open(file_path, 'r') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io.open

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"""creates a regex out of the contents of a lookup table file"""
lookup_elements = []
with open(file_path, 'r') as f:
for l in f.readlines():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for line in f:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

rasa_nlu/training_data/util.py Outdated Show resolved Hide resolved
lookup_elements = []
with open(file_path, 'r') as f:
for l in f.readlines():
new_elements = [e.strip() for e in l.split(',')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm actually wondering if maybe we should just do a file where you only provide one word/phrase per line, rather than allowing new lines and comma separated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is now we can do either or a combination of both. I can't think of a good argument for restricting that. Figured that the user might have different delimiters and wanted to minimize the amount of pre-processing that needed to happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just think it's a bit odd haha, if we tell the user to do one thing it might make things less confusing. If we're leaving this the way it is, then we definitely need to mention in the docs that you can either separate by newline or by comma or both :P @tmbo would still like your opinion on this though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point of @twhughes, but I'd go for restricting it to newlines for the moment as well. If new use cases come up that require separation on other characters, we can still add it again.
The reason is that once we add it, we need to support it for the future, and I'd rather avoid needing to support features where we are unsure about the usefulness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright @twhughes once you've made that change i'll give this PR another review

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akelad ok I made the change to the code and docs. only accepts newline-separated lookups now (commas will be treated as part of the phrase)

Copy link
Contributor

@akelad akelad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, just two minor things, and let's see what @tmbo says

docs/dataformat.rst Outdated Show resolved Hide resolved
rasa_nlu/featurizers/regex_featurizer.py Outdated Show resolved Hide resolved
Copy link
Member

@tmbo tmbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, the only thing you need to look at is escaping the values before building the regex

data/test/lookup_tables/lookup_table.md Outdated Show resolved Hide resolved
rasa_nlu/featurizers/regex_featurizer.py Outdated Show resolved Hide resolved
rasa_nlu/featurizers/regex_featurizer.py Outdated Show resolved Hide resolved
rasa_nlu/featurizers/regex_featurizer.py Outdated Show resolved Hide resolved
rasa_nlu/training_data/formats/markdown.py Outdated Show resolved Hide resolved
Copy link
Contributor

@akelad akelad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one small thing about the error message, then we can merge!

try:
f = io.open(lookup_elements, 'r')
except IOError:
raise ValueError("Could not load lookup table {}".format(lookup_elements))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you change this to "Could not load lookup table {}. Make sure you've provided the correct path"

@twhughes twhughes merged commit 679c069 into master Sep 11, 2018
@tmbo
Copy link
Member

tmbo commented Sep 11, 2018

Great work 🚀

@tmbo tmbo deleted the regex_phrase_matcher branch September 11, 2018 15:18
@tmbo tmbo mentioned this pull request Oct 19, 2018
4 tasks
taytzehao pushed a commit to taytzehao/rasa that referenced this pull request Jul 14, 2023
…asaHQ#1312)

Bumps [github.com/mattn/go-isatty](https://github.com/mattn/go-isatty) from 0.0.14 to 0.0.16.
- [Release notes](https://github.com/mattn/go-isatty/releases)
- [Commits](mattn/go-isatty@v0.0.14...v0.0.16)

---
updated-dependencies:
- dependency-name: github.com/mattn/go-isatty
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants