Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dict idioms #747

Open
ampli opened this issue Apr 21, 2018 · 7 comments
Open

Dict idioms #747

ampli opened this issue Apr 21, 2018 · 7 comments

Comments

@ampli
Copy link
Member

ampli commented Apr 21, 2018

Currently, subscripted idioms are forbidden, so a dict entry like
a_b.c: something;
is considered to be a definition for the word a_b (a word which includes an underbar).
This can be useful if one wants to introduce a word which includes a an underbar (no other way just now).

But it seems to me that it is more useful to allow subscripted idioms.
I encountered that when I tried to check if it is possible to "correct" an idiom usage by the .# device (I still don't have a better name for this idea).
For example (only - this particular correction may not be a good idea):
in_to.#into: [into]0.65;
An example of a possible subscripted idiom:
take_away.p: ...;
(BTW it is not currently in the dict.)

I found that only a minor fix is needed in order to allow subscripted idioms, and I can send a PR if removing this restriction looks as a good idea.

@ampli ampli changed the title Why idioms cannot have a subscript? Dict idioms Apr 23, 2018
@ampli
Copy link
Member Author

ampli commented Apr 23, 2018

After a few more changes, the following works fine:

rather_then.#rather_than: rather_than;

(It is an expanded form of the existing commented out entry % rather_then: rather_than;.)
The needed changes where:

  1. Insert idioms into the dict also in their original form.
  2. Don't look at subscripts for underbars.

As a bonus, this now works too:

linkparser> !!a_lot
Token "a_lot" matches:
    a_lot                            10  disjuncts
Token "a_lot" expressions:
    a_lot                      [[(({[@M+]0.400 or Mp+} & SJlp+) or ({[@M+]1.400 or [Mp+]} & SJrp-))]] or EC+ or MVa- or ((MVw- & OFw+)) or Wa-

linkparser> !!a_*
(All idioms starting with the word "a" are listed.)

When I made change (1`) above, I got numerous errors on duplicate idioms.
I guess is that many of them were accumulated over time because there was no check for that.
Until they are fixes (if needed) I just allowed them by default. They can be listed using:

link-parser --test=dup-idioms

Duplicate examples (total 43):

Ignoring word "and_yet", which has been multiply defined:
	 Line 12142, next tokens: ";" "..y" "*.j" "•" "⁂" 
link-grammar: Error: While parsing dictionary en/4.0.dict:
Ignoring word "but_not", which has been multiply defined:
	 Line 12142, next tokens: ";" "..y" "*.j" "•" "⁂" 

BTW, a change can be introduced to automatically report the line numbers the dict m4 source (when applicable) if you feel this is more useful.

@ampli
Copy link
Member Author

ampli commented Apr 23, 2018

Not supported yet (but of course can be):

  1. Dict words which contain underbars.
    For example, the following word cannot currently directly supported: snake_case.
    (I guess it can still be supported just now through a regex.)
    It is not a trivial change, but also it is not hard to implement it (the dict definition will use snake\_case).
  2. Correction definitions like these (currently commented out):

% all.#all_of: [all_of]0.65;
It is not working for now because there is no all_of idiom definitions. However, it can be made to work nevertheless (I will try that).

BTW, I started to investigate the idiom-related stuff after a long pause because I started to actually implement capitalization using the dict (issue #690). While thinking of that, it occured to me that capitalized words can be seen as a special kind of idiom, and this hints on an implementation possibility.

Since my current idiom-related changes seem to me useful, I will send a PR for them.

@linas
Copy link
Member

linas commented Apr 23, 2018

I like the idiom-printing extension.

I don't understand why idiom subscripts are useful. Subscripting in general does not seem to be all that useful, except that it helps with the authoring of the dictionary, and some of the debugging of the dictionary; I don't think its useful to end users.

Duplicate entries for idioms seems OK to me.

@ampli
Copy link
Member Author

ampli commented Apr 24, 2018

I don't understand why idiom subscripts are useful. Subscripting in general does not seem to be all that useful, except that it helps with the authoring of the dictionary, and some of the debugging of the dictionary; I don't think its useful to end users.

I see several pros for it, and don't see cons:

  1. At least, they are useful for "correction" entries.
  2. I may be useful for idioms that may serve as several POS, like take away.
  3. It removes an exception for the possibility to add a subscript.
  4. It is a trivial change that doesn't introduce any problem, and can be just left unused most of the times.
  5. It may be useful in cases that we didn't think of just now.

Duplicate entries for idioms seems OK to me.

It is not clear to me that if a definition of an idiom got fixed, all its other entries (there may be more than 2) are checked for the need of a similar fix. In addition, the idiom can be both in a word list and directly in the dict, and this doesn't seem to me intentional.

EDIT: Fix a typo.

@ampli
Copy link
Member Author

ampli commented Apr 24, 2018

I just sent PR #751.
I couldn't add a ChangeLog line due to a possible conflict.
Here is the line:

  • Add idiom lookup possibility in link-parser's dict lookup command (!!idiom_here).

@linas
Copy link
Member

linas commented Apr 24, 2018

I guess we could add subscripts to all the 43 duplicate idioms. Could you provide that list, or show me how to do get it?

@ampli
Copy link
Member Author

ampli commented Apr 24, 2018

link-parser -test=dup-idioms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants