Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anysplit: Fix split to more than 2 parts #481

Merged
merged 4 commits into from
Jan 25, 2017
Merged

Conversation

ampli
Copy link
Member

@ampli ampli commented Jan 24, 2017

Main fixes:

  1. General: Contraction dict validation should not be done in the tokenizer if anysplit is used.
  2. Fix a problem in the MOR- regexes.
  3. Add issuing a prefix if more than 2 tokens.

The number of suffixes is still variable.
This can be fixed, supposing a definition of how to perform the marking in
case of more than 3 tokens.


I noted a bad interaction with the random splitting and the current way sane-morphism is done:
A sentence may have many millions of parses, but only a few ones are displayed (say even 1-3).

This happens due to a combination of several things:

  1. Every word is broken to many alternatives.
  2. The parser is currently doing "mix&match" of the alternatives.
  3. The density of good" linkages w/o alternatives mix is very low.
  4. The linkage array (for the regular parser) is getting fill before sane-morphism is enforced.

Due to that, only very few "sane" linkages remain.

Possible fix:
Make sane-morphism on the fly when filling the linkage array. This will also simplify the program.
Am lower !limit can be used then to prevent issuing of the default 1000 linkages.

A similar problem currently happens also in English for some sentences with many linkages
(when a large portion of them is insane due to bogus unit splitting tries).

I can send a PR if this is a good fix.

ampli added 4 commits January 24, 2017 21:30
Fixes:
MOR-PREF: It should not contain a wildcard before the '=', since
this matches also a subscript mark.
MOR-STEM: It should not contain a '=' in its start.
MOR-SUFF: Add '=' at its start.

Also add ^ and $.

N.B.
MOR-SUFF could work as is (w/o '=' at its start), because nothing
before it matches strings starting with '='. However, I think that
for efficiency and readability it is better to start it with "^=".
The number of suffixes is still variable.
This can be fixed, given a definition of how to perform the marking in
case of more than 3 tokens.
@ampli ampli changed the title anysplit: Skip contraction check anysplit: Fix split to more than 2 parts Jan 25, 2017
@linas linas merged commit cbfc910 into opencog:master Jan 25, 2017
@linas
Copy link
Member

linas commented Jan 25, 2017

thanks!

@linas
Copy link
Member

linas commented Jan 25, 2017

this seems to mostly fix, but .. well, let me open a new issue..

@linas
Copy link
Member

linas commented Jan 25, 2017

Issue #482 describes the problem

@ampli ampli mentioned this pull request Jan 25, 2017
@ampli ampli deleted the utf8anysplit branch February 10, 2017 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants