-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dialect support! #402
Comments
BTW, this is very similar to the "named disjuncts" idea in https://groups.google.com/forum/#!msg/link-grammar/fFUhgSO0oL4/6RPgcRfpBAAJ just using a different syntax. To really get named disjuncts, one should allow the syntax (A+ & B+){some-name} which means that (A+ & B+) is mandatory, and is given "some-name". We can still pretend that "some-name" is a dialect, and we can also give it a cost outside of the dicationary. By default, all names have a cost of zero. |
You propose
to select a disjunct by a dialect.
Similarly, a syntax for different costs by dialect seems to me desired.
I.e., {DIALECT} / {-DIALECT} just select/deselect the item before them. There may be several other shortcuts needed, like letting one symbol represent several similar dialects that only slightly differ. |
Since I would rather not make any changes to the cost system. To specify two different costs, just repeat the disjuncts twice. |
So for example:
or even
gives one cost for irish, and another for polish. This uses the currently-implemented square-bracket system for costs -- i.e. no changes are needed there. |
I had a format error in my proposal - I didn't intend to propose any change in the cost system...
But if you refer dialects as "just string", how you know to pre-enable all "not-*" when, for example, no dialect is enabled? |
Note that you cannot just merge together disjuncts of different words (without an additional mechanism to find that there was a problem, and what the fix is). For example, the bad sentence is: I ask that because I don't understand yet the fine details of your idea. |
{DIALECT} and {-DIALECT} works, I was only using {not-DIALECT} as more verbose version. |
For their/there, we would need a mechanism for indicating alternatives, such as that in issue #404 |
I made an initial implementation. I am still not sure about how to enable dialects by default. What I mean is that it seems to me useful that in a given dict, some dialects will be enabled by default and some not. For example, suppose that Another question is how to implement the API for enabling/disabling dialects. Also, it seems useful to have an API to fetch the list of all dialects and their defaults. Examples:
|
Here are some implementation details: I added I still need to write In the post above I wrote:
It actually should be |
Rather than treating dialects as boolean on/off options, perhaps they should be treated as variable costs? So, to fully enable the Irish-English grammar, set the cost to 0.0. To disable it, set it to 3.0 or higher. So, if I think a text is Irish-American tinged, I could set it to maybe 1.0, thus preferring standard grammar, and falling back on an Irish interpretation if standard grammar is not possible. The costs would not be stored in the Should not store the dialect-enable flags (and/or dialect cost) in |
What should be the format? (I would rather not use a dict format, this seems to me oakward.)
|
Maybe the |
Yes, either format for Yes, I guess that the curly braces are not needed. For backwards-compatibility, square brackets without any number at all have a default cost of 1.0 -- therefore, we need to keep using square brackets, anyway. For the moment, this seems harmless. That suggests that .... We should support expressions like |
According to the current API style, the dialect setup functions need a library object to store their setting. The current way of changing parsing parameters is by using parse options. So we can add a parse option for it. Another thing to consider in the dialect API is that it may be a good idea to use preset settings that are defined in I first thought that a "dialect object" can be used (something like In any case, it is better to discuss this in details before I continue my implementation (even though I don't mind to experiment and later change it). Any initial API will be declared as "experimental" so we will be free to totally reimplement it. |
My basic implementation now supports expression definitions (and handling/displaying) such as Now I have to finish the implementation of
(Disregarding the old notation.) 4.0.dict:
4.0.dialect:
For the parse-options, I propose to add a cost vector: struct Parse_Options_s
{
...
/* Options governing the parser internals operation */
...
double *dialect_cost; /* Cost associated with dialect tags (NULL=default costs). */
...
} (My implementation enumerates the dialect tags with internal numbers.) For manipulation dialects I propose the following API: void *dialect; // dialect object
void *lg_dialect_create(Dictionary); // return a dialect object
lg_dialect_delete(void *dialect);
lg_dialect_set(void *dialect, const char *dialect_name, bool);
lg_dialect_cost(void *dialect, const char *dialect_name, double);
parse_options_set_dialect(Parse_Options, const void *dialect);
void *parse_options_get_dialect(Parse_Options); |
@linas, I need your input on the above proposal, so I can finish this implementation. |
I implemented LG-dialect as a proof-of-concept at the time of writing the last post here (June 2018). Now I would like to convert it to a production code. Some other decisions should also be made. For example, when to convert the dialect costs to actual expression costs. I think the best place is in expression pruning, since by default most of the dialects are expected to be neutralized (represented by a high cost constant). I'm preparing a PR which is a complete implementation, to be regarded as a proposal. I will then be able to make changes and update this PR as needed before it is applied. |
I need to add a pointer field in |
Looking; there are apparently comments from June that I missed. Sorry! |
Yes, float should be enough. |
This seems like overkill:
I cannot think of a good reason to have this. |
This API:
I assume this will store a vector of names+costs (or a map, name->cost)
what's the bool for? to enable/disable that specific dialect? In that case, something like this would work better:
so that removing it is the same as disabling it. |
Is it expected that all the disjuncts of a specific dialect will have the same cost? |
Oh! Ah! ... I see what you are doing:
...Is that the intent? |
The idea is to be able to select a dialect without specifying its cost at all, as a normal mean to select a dialect - the cost(s) will be as defined in
Yes!!! |
OK. What's the |
Using your example:
Then to enable the To tune or enable a specific component: Or to turn it off completely: |
I made about 2/3 progress in this project, and found that my proposed interface (and your original one) is very cumbersome to use, especially to support reasonable For now (after I have implemented the cumbersome functions) it seems to me my original proposal (to use a string API) is superior. My current API implementation is: typedef struct Dialect_Option_s Dialect_Object; /* Opaque handle. */
Dialect_Option parse_options_get_dialect(Parse_Options opts);
void parse_options_set_dialect(Parse_Options opts, Dialect_Option dopt);
bool lg_dialect_set(Dialect_Option dopt, const char *dialect_vector, bool useit);
#define DIALECT_COST_DISABLE 10000.0 /* A high cost setting to disable disjuncts */
#define DIALECT_COST_REMOVE 10001.0 /* Use the preset cost */
bool lg_dialect_cost(Dialect_Option dopt, const char *dialect_component, double cost); What is missing but needed is API to fetch the current settings, like: const char *lg_dialect_get(Dialect_Option dopt, int i); /* Get the i'th name (NULL if no more) */
char *name, *cost;
void lg_dialect_cost_get(Dialect_Option dopt, int i, name, cost); /* Get a dialect component (NULL if no more) */ The What we need is UI to:
Instead of all of that, I propose this simple (both for use and implementation) way: This will enable dialect names Benefits of this proposal:
|
I forgot to add 2 additional user API calls that are used in the "many-user-API-functions" approach: /* Create and delete a diaelct object. */
Dialect_Option lg_dialect_create(void);
void lg_dialect_delete(Dialect_Option dopt); I will just implement it in both approaches at once so we can decide which is better. |
The dialect-supporting version has passed my initial tests.
This approach has a slight problem: Errors in setting the This situation can be somewhat improved by making syntax checks (in This problem doesn't exist with the "many-user-API-functions" approach. However, it is complex. I already have a partial implementation of that under My test implementation includes only a minimal English "headline What I can do is to submit a PR of my current work as a request for review, in order that you will be able to actually test it and tell me about needed changes before it is applied. I would like to fist merge with an updated master branch so I will need PR #1058 to be applied first. |
Another problem to be solved is Possible solutions:
|
I think that having |
I also added a
I used the component name Example from the dict: The current
|
I found it useful to work with list of disjuncts. But it may be big so I filter the needed ones. |
sure. It'll take me a few days to stew over it.
Sure, why not. I assume |
I have a branch in which I applied the classic pruning results back into the expression pruning code. However, this code may serve to display compact expressions that don't produce disjuncts that got removed by BTW, a similar idea may also serve to display expressions with the connectors that participate in the linkage marked in them. |
Not sure I like that. One of the common dictionary debugging problems is answering the quqstion "why didn't connector X attach to connector Y because it seems they should" and if these are already "silently" pruned away, that would be .. confusing. On the other hand, having much shorter disjunct lists would be nice. Perhaps two displays? |
I guess it is possible to show the disjuncts before and after |
If the expression is displayed after pruning you may easily note the case in which the answer is "because it (X or Y) got pruned away".
There is a need for some way in the
The place for the flags may be before a leading |
After a trailing slash seems better, from a usability standpoint: one tries !!word, and then consults the docs, then up=arrow and add the backslash, -- less arrow-key usage to look at the multiple versions. Or even print all versions at once... |
The disjunct list is typically tens of thousands of lines... It is not so friendly thing if you only want the expression. |
I finished to implement |
This discussion includes:
|
I do not anticipate needing that. |
So, I guess this issue can be closed!? |
Oh, wait: it needs to be documented on the website ... |
Eventually, I need to add it to the man page (on the next man page overhaul). |
Added in 9a62ddc. |
Issue #1172 got opened for completing the dialect API. |
Below is a sketch of how to add dialect support, and why its a good idea.
Currently, {} is used to indicate optional connectors: for example: A+ & {B- & C+} indicates that (B- & C+) is optional.
Lets give options names! These names will be names of dialects! So, for example: A+ & {B- & C+}{irish} means that (B- & C+) is optional, but only if the "irish" dialect is enabled; otherwise, it is never allowed.
In my imagination, this solve zillions of problems. These include:
A) the bad-spelling problem: create a kant-spel dialect, that merges together the disjuncts for they're there and their (and throws in thier, for good measure)
B) enhanced support for ... irish-english, black-american-english, australian-english, hillbilly-basilect, archaic 19th-century English, twitterese, newspaper-headlines
C) Automatic detection of dialects! So, for example, if a sentence does not parse normally, but does parse after enabling some dialect, we can guess that it must be that dialect.
D) post-parse parse-ranking. That is, parse a sentence with all dialect enabled, but then fiddle with the costs associated with each particular dialect. Thus, to turn off the kant-spel dialect, one simply gives those connectors a very high cost, and they would be raked last.
The text was updated successfully, but these errors were encountered: