anysplit: Fix split to more than 2 parts #481

ampli · 2017-01-24T19:33:53Z

Main fixes:

General: Contraction dict validation should not be done in the tokenizer if anysplit is used.
Fix a problem in the MOR- regexes.
Add issuing a prefix if more than 2 tokens.

The number of suffixes is still variable.
This can be fixed, supposing a definition of how to perform the marking in
case of more than 3 tokens.

I noted a bad interaction with the random splitting and the current way sane-morphism is done:
A sentence may have many millions of parses, but only a few ones are displayed (say even 1-3).

This happens due to a combination of several things:

Every word is broken to many alternatives.
The parser is currently doing "mix&match" of the alternatives.
The density of good" linkages w/o alternatives mix is very low.
The linkage array (for the regular parser) is getting fill before sane-morphism is enforced.

Due to that, only very few "sane" linkages remain.

Possible fix:
Make sane-morphism on the fly when filling the linkage array. This will also simplify the program.
Am lower !limit can be used then to prevent issuing of the default 1000 linkages.

A similar problem currently happens also in English for some sentences with many linkages
(when a large portion of them is insane due to bogus unit splitting tries).

I can send a PR if this is a good fix.

Fixes: MOR-PREF: It should not contain a wildcard before the '=', since this matches also a subscript mark. MOR-STEM: It should not contain a '=' in its start. MOR-SUFF: Add '=' at its start. Also add ^ and $. N.B. MOR-SUFF could work as is (w/o '=' at its start), because nothing before it matches strings starting with '='. However, I think that for efficiency and readability it is better to start it with "^=".

The number of suffixes is still variable. This can be fixed, given a definition of how to perform the marking in case of more than 3 tokens.

linas · 2017-01-25T19:33:18Z

thanks!

linas · 2017-01-25T20:18:59Z

this seems to mostly fix, but .. well, let me open a new issue..

linas · 2017-01-25T20:39:22Z

Issue #482 describes the problem

ampli added 4 commits January 24, 2017 21:30

anysplit: Skip contraction check

75158d0

anysplit: Use symbolic names for dbug levels

b788a56

anysplit: Issue a prefix in case of more than 2 tokens

089fb68

The number of suffixes is still variable. This can be fixed, given a definition of how to perform the marking in case of more than 3 tokens.

ampli force-pushed the utf8anysplit branch from 9ac9927 to 089fb68 Compare January 25, 2017 01:25

ampli changed the title ~~anysplit: Skip contraction check~~ anysplit: Fix split to more than 2 parts Jan 25, 2017

linas merged commit cbfc910 into opencog:master Jan 25, 2017

ampli mentioned this pull request Jan 25, 2017

anysplit issues. #482

Open

ampli deleted the utf8anysplit branch February 10, 2017 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anysplit: Fix split to more than 2 parts #481

anysplit: Fix split to more than 2 parts #481

ampli commented Jan 24, 2017 •

edited

Loading

linas commented Jan 25, 2017

linas commented Jan 25, 2017

linas commented Jan 25, 2017

anysplit: Fix split to more than 2 parts #481

anysplit: Fix split to more than 2 parts #481

Conversation

ampli commented Jan 24, 2017 • edited Loading

linas commented Jan 25, 2017

linas commented Jan 25, 2017

linas commented Jan 25, 2017

ampli commented Jan 24, 2017 •

edited

Loading