Regex phrase matcher #1312

twhughes · 2018-08-13T12:26:44Z

Proposed changes:

Lookup tables may now be specified in the training data. Individual lookup elements may be included directly as lists of strings. Alternatively the externally supplied lookup tables may be specified in the form of external files separated by newlines.

For example

{
    "rasa_nlu_data": {
        "lookup_tables": [
            {
                "name": "streets",
                "elements":  ["main street", "washington ave", "elm street", "rocky road"]
            }
        ]
    }
}

or in markdown format:

  ## lookup:streets
    - main street
    - washington ave
    - elm street
    - rocky road

External data files may be supplied as well. For example, data/lookup_tables/streets.txt may contain

main street
washington ave
elm street
rocky road

And can be loaded in along with additional elements as:

{
    "rasa_nlu_data": {
        "lookup_tables": [
            {
                "name": "streets",
                "elements":  "data/lookup_tables/streets.txt"
            }
        ]
    }
}

or, equivalently, in markdown format as:

  ## lookup:streets
  data/lookup_tables/streets.txt

When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. The regex will only match phrases that are surrounded by word boundaries, such as spaces, newlines, commas, periods, etc.

These regexes match over multiple tokens, so if main street was specified in the lookup table, this would match the tokens of meet me at 1223 main street as 0 0 0 0 1 1.

The generated regexes are processed identically to the regular regex patterns directly specified in the training data.

Status (please check what you already did):

made PR ready for code review
added some tests for the functionality
updated the documentation
updated the changelog

…x_phrase_matcher

akelad

Few things, the main one being how we should structure the lookup table. In my opinion it's better to have a word/phrase per line, rather than comma and new line separation. @tmbo your input on this is welcome as well

akelad · 2018-08-29T12:52:05Z

docs/dataformat.rst

@@ -37,13 +37,16 @@ Examples are grouped by intent, and entities are annotated as markdown links.
    ## regex:zipcode
    - [0-9]{5}

+    ## lookup:streets
+    - path/to/streets.txt


I think we should maybe have a relevant intent example here, otherwise this lookup table makes no sense

what do you mean by this? Like a lookup table that would actually be used in the banking example? I updated it to a lookup table named accounts that has path path/to/accounts.txt

So what I meant is adding one/two lines of intent examples further up that would have some sentences where a lookup table could be relevant. I'm not sure it's relevant to the current example there is. So maybe add a new intent that would make good use of a lookup table. Does that make sense?

Yes, I think it makes sense. I changed it to a lookup table of accounts names and added a comment



So now I think it is relevant without having to add more examples. For example the synonym pink pig in the example is also not directly relevant. Let me know if you agree.

akelad · 2018-08-29T12:54:48Z

docs/dataformat.rst

+-------------
+Lookup tables in the form of external files can also be specified in the training data.  The externally supplied lookup tables must be in a comma-separated format.  For example, ``data/lookup_tables/streets.txt`` may contain
+
+    main street, washington ave, elm street, ...


Rather than streets, maybe just include one of the test files (plates or drinks)?

akelad · 2018-08-29T12:55:00Z

docs/dataformat.rst

+
+    main street, washington ave, elm street, ...
+
+And can be loaded in along with ``data/lookup_tables/cities.txt`` as:


not sure we really need to show how to load two, one should be enough

ok, changed

akelad · 2018-08-29T12:56:27Z

docs/dataformat.rst

+        }
+    }
+
+When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples.  These regexes match over multiple tokens, so ``main street`` would match ``meet me at 1223 main street at 5 pm`` as ``[0 0 0 0 1 1 0 0 0]``.  These regexes are processed identically to the regular regex patterns directly specified in the training data.  A few lookup tables for common entities are specified in ``rasa_nlu/data/lookups/``


wait are they? I don't see a rasa_nlu/data/lookups/ 😄

I'm still working on them.. may add later. For now will remove this comment.

Yeah I'm not sure we'll add them to the NLU repo, or host them elsewhere tbh

akelad · 2018-08-29T13:02:42Z