Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dialect support! #402

Closed
linas opened this issue Sep 16, 2016 · 50 comments
Closed

Dialect support! #402

linas opened this issue Sep 16, 2016 · 50 comments

Comments

@linas
Copy link
Member

linas commented Sep 16, 2016

Below is a sketch of how to add dialect support, and why its a good idea.

Currently, {} is used to indicate optional connectors: for example: A+ & {B- & C+} indicates that (B- & C+) is optional.

Lets give options names! These names will be names of dialects! So, for example: A+ & {B- & C+}{irish} means that (B- & C+) is optional, but only if the "irish" dialect is enabled; otherwise, it is never allowed.

In my imagination, this solve zillions of problems. These include:

A) the bad-spelling problem: create a kant-spel dialect, that merges together the disjuncts for they're there and their (and throws in thier, for good measure)

B) enhanced support for ... irish-english, black-american-english, australian-english, hillbilly-basilect, archaic 19th-century English, twitterese, newspaper-headlines

C) Automatic detection of dialects! So, for example, if a sentence does not parse normally, but does parse after enabling some dialect, we can guess that it must be that dialect.

D) post-parse parse-ranking. That is, parse a sentence with all dialect enabled, but then fiddle with the costs associated with each particular dialect. Thus, to turn off the kant-spel dialect, one simply gives those connectors a very high cost, and they would be raked last.

@linas
Copy link
Member Author

linas commented Sep 16, 2016

BTW, this is very similar to the "named disjuncts" idea in https://groups.google.com/forum/#!msg/link-grammar/fFUhgSO0oL4/6RPgcRfpBAAJ just using a different syntax.

To really get named disjuncts, one should allow the syntax (A+ & B+){some-name} which means that (A+ & B+) is mandatory, and is given "some-name". We can still pretend that "some-name" is a dialect, and we can also give it a cost outside of the dicationary. By default, all names have a cost of zero.

@ampli
Copy link
Member

ampli commented Sep 16, 2016

You propose

{DISJUNCT}{DIALECT}

to select a disjunct by a dialect.
But what if I want to deselect disjuncts? It can be very cumbersome if only disjunct including by dialect is possible.
Maybe:

{DISJUNCT}{-DIALECT}

Similarly, a syntax for different costs by dialect seems to me desired.
Maybe:

{DISJUNCT}[default_cost][cost_for_DIALECT1]{DIALECT1}

I.e., {DIALECT} / {-DIALECT} just select/deselect the item before them.

There may be several other shortcuts needed, like letting one symbol represent several similar dialects that only slightly differ.

@linas
Copy link
Member Author

linas commented Sep 17, 2016

Since DIALECT is just a string, then -DIALECT would just be a naming convention. So one could say , for example {F- & G+}{not-irish} and then just always disable not-irish whenever irish is enabled.

I would rather not make any changes to the cost system. To specify two different costs, just repeat the disjuncts twice.

@linas
Copy link
Member Author

linas commented Sep 17, 2016

So for example:

([A+ & B+]0.5){irish} or ([A+ & B+]1.75){polish}

or even

<a-and-b>: A+ & B+;
([<a-and-b>]0.5){irish} or ([<a-and-b>]1.75){polish}

gives one cost for irish, and another for polish. This uses the currently-implemented square-bracket system for costs -- i.e. no changes are needed there.

@ampli
Copy link
Member

ampli commented Sep 17, 2016

I had a format error in my proposal - I didn't intend to propose any change in the cost system...
I can repeat it with a proper format, but it seems to me it is best to see if such shortcuts are needed after actually using dialects widely.

Since DIALECT is just a string, then -DIALECT would just be a naming convention. So one could say , for example {F- & G+}{not-irish} and then just always disable not-irish whenever irish is enabled.

But if you refer dialects as "just string", how you know to pre-enable all "not-*" when, for example, no dialect is enabled?

@ampli
Copy link
Member

ampli commented Sep 17, 2016

A) the bad-spelling problem: create a kant-spel dialect, that merges together the disjuncts for they're there and their (and throws in thier, for good measure)

Note that you cannot just merge together disjuncts of different words (without an additional mechanism to find that there was a problem, and what the fix is).
If you define the word "their" also with the disjuncts of "there", you will get a "strange" linkage (e.g. the word "their" will appear with the disjuncts of "there").
How you can find then if a correction is needed and what the exact correction is?

For example, the bad sentence is:
*It is hot their.
How the dictionary (for fixing "there" and "their" only) looks like?
How the linkage then looks like?

I ask that because I don't understand yet the fine details of your idea.
I have my own proposal, as mentioned in my group posts, but I would like to understand yours.

@linas
Copy link
Member Author

linas commented Sep 27, 2016

{DIALECT} and {-DIALECT} works, I was only using {not-DIALECT} as more verbose version.

@linas
Copy link
Member Author

linas commented Sep 27, 2016

For their/there, we would need a mechanism for indicating alternatives, such as that in issue #404

@ampli
Copy link
Member

ampli commented May 19, 2018

I made an initial implementation.
In this implementation {tag} can name any sub-expression.
I called it "expression tag" and not "expression name" because "expression name" is already used by <name>.

I am still not sure about how to enable dialects by default. What I mean is that it seems to me useful that in a given dict, some dialects will be enabled by default and some not. For example, suppose that {irish} is enabled by default, but {headline} is not. This means that there should be a definition in the dict that declare which of the dialects are enabled and which are disabled by default. I don't have an idea how to provide such a list, since there is currently no good way to define strings (using connectors for that is too cumbersome). Maybe we can add a #define directive (that will be used for version etc. too),
or use something like:
<default-dialect>: (){irish};
<default-dialect>: (){-headline}; % News writing style.

Another question is how to implement the API for enabling/disabling dialects.
One way is to make it a string like "dialect1,dialect2" to enable these dialects, or "-dialect3" to disable dialect3 if it is enabled by default. Another way is to use a NULL terminated char ** argument.

Also, it seems useful to have an API to fetch the list of all dialects and their defaults.

Examples:

<a-and-b>: A+ & B+;
tt: (XXX+{test} & ({YYY+}{test1} or X1-){test2}) or
([<a-and-b>]0.5){irish} or ([<a-and-b>]1.75){polish} or {F- & G+}{testit};

<no-det-null>: [[[[()]]]]{headline} or (){-headline};

@ampli ampli self-assigned this May 20, 2018
@ampli
Copy link
Member

ampli commented May 20, 2018

Here are some implementation details:

I added char *tag field to the Exp struct.
It is set when reading the expression from the dict.
(I also modified accordingly print_expression_parens() and fixed a bug in it regarding printing a costly null.)
At the start of expression_prune() a modified purge_Exp() is called to purge the expressions with disabled tags. This is done by making them a null expression. I hope this is the required semantics....

I still need to write dictionary_get_dialects() and parse_options_*_dialects(), but I need some input regarding my previous post.

In the post above I wrote:

<no-det-null>: [[[[()]]]]{headline} or (){-headline};

It actually should be <no-det-null>: [[[[()]]]]{-headline} or (){headline};

@linas
Copy link
Member Author

linas commented May 21, 2018

Rather than treating dialects as boolean on/off options, perhaps they should be treated as variable costs? So, to fully enable the Irish-English grammar, set the cost to 0.0. To disable it, set it to 3.0 or higher. So, if I think a text is Irish-American tinged, I could set it to maybe 1.0, thus preferring standard grammar, and falling back on an Irish interpretation if standard grammar is not possible.

The costs would not be stored in the Exp struct, they would be external.

Should not store the dialect-enable flags (and/or dialect cost) in 4.0.dict -- that would be confusing. A distinct file would be better. For an API to over-ride this file contents, I think that something like lg_dialect(const char*, bool); or lg_dialect(const char*, double); would suffice, where the char string is just a single dialect name; no comma separators, no +/- characters in the string.

@ampli
Copy link
Member

ampli commented May 21, 2018

  • If dialect labels are used as variable costs, do we still need the {} syntax?
    I.e., can we use something like [()]headline instead of [()]{headline}. Or maybe the [] are also not needed?
  • I started to look at the direct implementation when I was thinking of a possible pseudo-morphology implementation. In order to test some ideas I needed a way to identify sub-expressions in order to manipulate (or avoid to manipulate) them. I guess it will still be possible to use labels also for expression identification, something like (A+ & (B+ or C+))label (or (A+ & (B+ or C+)){label} if you think it is better to keep the {}).

A distinct file would be better.

4.0.dialect?

What should be the format? (I would rather not use a dict format, this seems to me oakward.)
A simple one can be something like:
headline: 4.0
But maybe we can have names for preset values:

[no-det]
default: 4
headline: 0.2

@ampli
Copy link
Member

ampli commented May 21, 2018

Maybe the {} has a benefit in that a default cost can be used, like [A+]0.5{something}.

@linas
Copy link
Member Author

linas commented May 21, 2018

Yes, either format for 4.0.dialect would be fine.

Yes, I guess that the curly braces are not needed.

For backwards-compatibility, square brackets without any number at all have a default cost of 1.0 -- therefore, we need to keep using square brackets, anyway. For the moment, this seems harmless. That suggests that ....

We should support expressions like [[A+]0.5]0.3 which would mean that A+ has a total cost of 0.8. Likewise, both [[A+]0.5]something and [[A+]something]0.5 should have a cost of something+0.5
I expect that these last two might get used a lot.

@ampli ampli mentioned this issue Jun 3, 2018
@ampli
Copy link
Member

ampli commented Jun 6, 2018

lg_dialect(const char*, bool); or lg_dialect(const char*, double);

According to the current API style, the dialect setup functions need a library object to store their setting.
In any case, their setting is dict-related, so they would need to get the dict as an argument.
However, I think it may not be a good idea to store the current dialect setup in the dict struct (e.g. to allow using the same dict from different threads).

The current way of changing parsing parameters is by using parse options. So we can add a parse option for it.

Another thing to consider in the dialect API is that it may be a good idea to use preset settings that are defined in 4.0.dialect, but there is also a need for programmatically set dialect cost values (for development and debug - it is very cumbersome to require that any tweaks will be done only by changing the dict/dialect files).

I first thought that a "dialect object" can be used (something like dialect = dialect_create()) and this object then be used by the dialect setting functions, and finally be provided to parse_options_set_dialect(). The, parse_options_set_*() API gets for now only simple object types and not arbitrary objects, but I don't have any argument why we cannot supply it with a "complex" opaque type.

In any case, it is better to discuss this in details before I continue my implementation (even though I don't mind to experiment and later change it). Any initial API will be declared as "experimental" so we will be free to totally reimplement it.

@ampli
Copy link
Member

ampli commented Jun 12, 2018

My basic implementation now supports expression definitions (and handling/displaying) such as [[A+]0.5]0.3, [[A+]0.5]something , [[A+]something]0.5 and even [[A+]something]more.

Now I have to finish the implementation of 4.0.dialect and their setting API.
I have specific proposals for the file format and the API, based on how I expect dialects will actually be defined and used. It may be that my expectations are not correct, or changes/ refinements are needed.
Hence we have to discuss it before my initial implementation (even though I will not have a problem to change it later as needed).

([A+ & B+]0.5){irish} or ([A+ & B+]1.75){polish}

(Disregarding the old notation.)
Suppose a support of "irish" dialect is added.
Of course it would not end in modifying costs of a single expression.
So we would have many expressions in the dict with costs which depend on this "irish" dialect.
It doesn't seem a good idea to force using a single cost addition in all of this expressions.
Instead, I propose to provide an infrastructure for a vector of costs, as follows (example):

4.0.dict:

<a>: X- & [[A+ & B+]0.5)]irish_a or Y+;
<b>: Z- & ([A+]irish_b & B+;

4.0.dialect:

[default]
% default costs
no_headline

[irish]
irish_a: 0.8
irish_b: 2

[no_headline]
headline: 4

For the parse-options, I propose to add a cost vector:

struct Parse_Options_s
{
...
/* Options governing the parser internals operation */
...
	double *dialect_cost;  /* Cost associated with dialect tags (NULL=default costs). */
...
}

(My implementation enumerates the dialect tags with internal numbers.)

For manipulation dialects I propose the following API:

void *dialect; // dialect object
void *lg_dialect_create(Dictionary); // return a dialect object
lg_dialect_delete(void *dialect);

lg_dialect_set(void *dialect, const  char *dialect_name, bool);
lg_dialect_cost(void *dialect, const  char *dialect_name, double);

parse_options_set_dialect(Parse_Options, const void *dialect);
void *parse_options_get_dialect(Parse_Options);

@ampli
Copy link
Member

ampli commented Jun 23, 2018

@linas, I need your input on the above proposal, so I can finish this implementation.

@ampli ampli mentioned this issue Nov 7, 2018
@ampli ampli mentioned this issue Jul 20, 2019
@ampli
Copy link
Member

ampli commented Dec 4, 2019

I implemented LG-dialect as a proof-of-concept at the time of writing the last post here (June 2018).
It didn't include API at all.

Now I would like to convert it to a production code.
There is no "natural" API to use, so I intend to develop something that is both easy to use and to program. There is also a need for link-parser UI.

Some other decisions should also be made. For example, when to convert the dialect costs to actual expression costs. I think the best place is in expression pruning, since by default most of the dialects are expected to be neutralized (represented by a high cost constant).

I'm preparing a PR which is a complete implementation, to be regarded as a proposal. I will then be able to make changes and update this PR as needed before it is applied.

@ampli
Copy link
Member

ampli commented Dec 4, 2019

I need to add a pointer field in Exp_struct, which is now 32 bytes, but I would not like to make it bigger.
This is possible if I change cost from double to float.
Is there a special reason that costs are represented by double and not float?

@linas
Copy link
Member Author

linas commented Dec 4, 2019

Looking; there are apparently comments from June that I missed. Sorry!

@linas
Copy link
Member Author

linas commented Dec 4, 2019

float

Yes, float should be enough.

@linas
Copy link
Member Author

linas commented Dec 4, 2019

This seems like overkill:

[irish]
irish_a: 0.8
irish_b: 2

I cannot think of a good reason to have this.

@linas
Copy link
Member Author

linas commented Dec 4, 2019

This API:

void *dialect;

I assume this will store a vector of names+costs (or a map, name->cost)

lg_dialect_set(void *dialect, const  char *dialect_name, bool);

what's the bool for? to enable/disable that specific dialect? In that case, something like this would work better:

void lg_dialect_add(void *dialect, const  char *dialect_name, double cost);
void lg_dialect_remove(void *dialect, const  char *dialect_name);

so that removing it is the same as disabling it.

@ampli
Copy link
Member

ampli commented Dec 4, 2019

[irish]
irish_a: 0.8
irish_b: 2

I cannot think of a good reason to have this.

Is it expected that all the disjuncts of a specific dialect will have the same cost?

@linas
Copy link
Member Author

linas commented Dec 4, 2019

Is it expected that all the disjuncts of a specific dialect will have the same cost?

Oh! Ah! ... I see what you are doing:

[named-vector]
vector-component-a: 0.2
vector-compnent-b: 0.8

...Is that the intent?

@ampli
Copy link
Member

ampli commented Dec 4, 2019

lg_dialect_set(void *dialect, const  char *dialect_name, bool);

The idea is to be able to select a dialect without specifying its cost at all, as a normal mean to select a dialect - the cost(s) will be as defined in 4.0.dialect. The other function lg_dialect_cost() is for cases in which it is desired to play with the costs.

Oh! Ah! ... I see what you are doing:
...
...Is that the intent?

Yes!!!

@linas
Copy link
Member Author

linas commented Dec 4, 2019

OK. What's the bool for?

@ampli
Copy link
Member

ampli commented Dec 4, 2019

Using your example:

[named-vector]
vector-component-a: 0.2
vector-compnent-b: 0.8

Then to enable the named-vector dialect:
lg_dialect_set(dialect, "named-vector", true);

To tune or enable a specific component:
lg_dialect_cost(dialect, "vector-compnent-b", 1.2);

Or to turn it off completely:
lg_dialect_cost(dialect, "vector-compnent-b", 9999.0);
(I defined a constant DIALECT_DISABLED for that, maybe it should be renamed to DIALECT_COMPONENT_DISABLED.)

@ampli
Copy link
Member

ampli commented Dec 7, 2019

I made about 2/3 progress in this project, and found that my proposed interface (and your original one) is very cumbersome to use, especially to support reasonable link-parser UI (which may be a problem of any user program that would like to use it).

For now (after I have implemented the cumbersome functions) it seems to me my original proposal (to use a string API) is superior.

My current API implementation is:

typedef struct Dialect_Option_s Dialect_Object;  /* Opaque handle. */

Dialect_Option parse_options_get_dialect(Parse_Options opts);
void parse_options_set_dialect(Parse_Options opts, Dialect_Option dopt);
bool lg_dialect_set(Dialect_Option dopt, const char *dialect_vector, bool useit);

#define DIALECT_COST_DISABLE    10000.0   /* A high cost setting to disable disjuncts */
#define DIALECT_COST_REMOVE     10001.0   /* Use the preset cost */
bool lg_dialect_cost(Dialect_Option dopt, const char *dialect_component, double cost);

What is missing but needed is API to fetch the current settings, like:

const char *lg_dialect_get(Dialect_Option dopt, int i); /* Get the i'th name (NULL if no more) */
char *name, *cost;
void lg_dialect_cost_get(Dialect_Option dopt, int i, name, cost); /* Get a dialect component (NULL if no more) */

The link-parser dialect implementation (not done yet) is supposed to use these lg_dialect_*_get() API to display the current setting (so it doesn't need to just specially remember the dialect setting). This is very cumbersome but still maybe reasonable.
However I couldn't find a reasonable matching link-parser UI that matches this API model - by reasonable I mean simple and easy to use. (To use a UI that doesn't directly use the underling API will need much add-hoc code.)

What we need is UI to:

  • Add and delete a dialect name (vector name). The component costs then are taken from 4.0.dialect.
  • Define a cost for a dialect component (vector component name and cost), including a way to define "infinite cost" to disable it.
  • Delete a cost for a previously so defined dialect component (so its cost will be taken from 4.0.dialect).

Instead of all of that, I propose this simple (both for use and implementation) way:
!dialect=vector1,vector2,component1:0.2,component2:,component3:0.8

This will enable dialect names vector1 and vector2 (i.e. their componnets costs will be used as defined in 4.0.dialect and in addition set (or override) the components component1, component2 and component3, when component2 is set to "infinite cost" in order to disable it (say it is defined in 4.0.dialect as a componnet of vector1).
(BTW, this example is to illustrate the idea - of course the user is not expected to issue such a complex setting - a complex setting is mainly for debug, development and testing).

Benefits of this proposal:

  • The link-parser dialect API implementation is then absolutely trivial and minimal: The same code that now implements !debug and !test is just used with !dialect and that's all.
  • Trivial and minimal library implementation: The above setup string is just like a [] section in 4.0.dialect (when newlines are replaced by commas) so the same code that parses the file can be used and the resulted data structure is also the one that is needed for actually applying the costs.
    Also, only 2 API functions are needed (parse_options_set_dialect() and parse_options_get_dialect()), and in the library there is no need to the extra data structures and internal API functions that are needed to support the many-user-API-functions approach.
  • No Dialect_Option object is needed.
  • No additional calls to retrieve the current setting (vector/component names) are needed.

@ampli
Copy link
Member

ampli commented Dec 7, 2019

I forgot to add 2 additional user API calls that are used in the "many-user-API-functions" approach:

/* Create and delete a diaelct object. */
Dialect_Option lg_dialect_create(void);
void lg_dialect_delete(Dialect_Option dopt);

I will just implement it in both approaches at once so we can decide which is better.

@ampli
Copy link
Member

ampli commented Dec 11, 2019

The dialect-supporting version has passed my initial tests.
However:

  1. The API I finished to implement consists only the parse_options_set/get_dialect() calls and nothing more. The parse_options_set_dialect() functions, like all the other parse_options_set_*() functions, get only a Parse_Options object and one argument (char * in this case).
  2. The UI of link-parser is only !dialect=string_config.

This approach has a slight problem: Errors in setting the dialect user variable (like specifying a nonexistence dialect) are not detected upon issuing the parse_options_set_dialect() call, because this function has no access to the dict.
The dialect setting (cached on success) is done in sentence_split(), and if it fails due to bad setting in the dialect variable, an appropriate error message is issued and sentence_split() fails. If this is a batch run, the number of errors is then meaningless, which may be considered a problem.

This situation can be somewhat improved by making syntax checks (in parse_options_set_dialect()) on the !dialect variable content, but the parse_options_set_*() calls don't return an error indication (BTW, I think they should, and this will be a mostly compatible change). However, in order to make a perfect validation, The dictionary handle should be somewhat provided to parse_options_set_dialect() (e.g. by an added argument), or alternatively by an additional API like check_dialect() (I guess trying to split a dummy sentence to that end is not a good option...).

This problem doesn't exist with the "many-user-API-functions" approach. However, it is complex. I already have a partial implementation of that under #ifdef DIALECT_OBJECT (undebugged) but it is extremely cumbersome to use. I'm not sure more efforts are needed in that direction, please advise. If my current implementation is fine with you, I will just remove this alternative implementation.

My test implementation includes only a minimal English "headlinedialect (dict and4.0.dialect` definitions) and it doesn't include tests sentences and a test suite (to be added later).

What I can do is to submit a PR of my current work as a request for review, in order that you will be able to actually test it and tell me about needed changes before it is applied. I would like to fist merge with an updated master branch so I will need PR #1058 to be applied first.

@ampli
Copy link
Member

ampli commented Dec 11, 2019

Another problem to be solved is !!word. It currently shows the expressions after dialect resolution of the "4.0.dialect` definitions, and not as they are in the dict.

Possible solutions:

  1. Add a dialect - to denote don't apply "4.0.dialect", to be issued as dialect=-. After that, !!word will show the expressions exactly as they are in the dict.
  2. Make !!word always show the exact expressions in the dict.
  3. Use !!word!flags, e.g. !word!o, (o for original) to show the exact expressions in the dict. (BTW other useful flags can be added, like d for listing the disjuncts).

@linas
Copy link
Member Author

linas commented Dec 12, 2019

I think that having !!word show what is currently in effect makes the most sense. Recall, !!word is a debugging utility, and if it shows something other than what is currently being done, it would be confusing. One the other hand, it is a fairly useless debug tool -- most words have hundreds if not tens of thousands of disjuncts, and picking over these is .. too hard. It's easier to work directly with the dictionary.

@ampli
Copy link
Member

ampli commented Dec 12, 2019

I also added a bad-spelling dialect, as you suggested in issue #404:

well, instead of sub-dictionaries, the problem would be solved by "dialect support" - #402 : turn off the "bad speling" dialect, and then these rules no longer apply.

I used the component name bad-spelling, and the dialect name no-bad-spelling in order to turn it off (by cost 4). If you like to use other names I can change that before submitting the PR (which is ready, BTW).

Example from the dict:
then.#than: [[than.e]0.65]bad-spelling;

The current 4.0.dialect definitions:

[default]
no-headline
bad-spelling: 0

[no-headline]
headline: 4

[headline]
headline: 0

[no-bad-spelling]
bad-spelling: 4

@ampli
Copy link
Member

ampli commented Dec 12, 2019

One the other hand, it is a fairly useless debug tool -- most words have hundreds if not tens of thousands of disjuncts, and picking over these is .. too hard. It's easier to work directly with the dictionary.

I found it useful to work with list of disjuncts. But it may be big so I filter the needed ones.
So maybe something like that may be useful:
!!word!d/filter_string/

@linas
Copy link
Member Author

linas commented Dec 13, 2019

PR

sure. It'll take me a few days to stew over it.

!!word!d/filter_string/

Sure, why not. I assume filter_string is a regex, and so maybe that, but without the d. or something. since d normally means 'digits'.

@ampli
Copy link
Member

ampli commented Dec 13, 2019

One the other hand, it is a fairly useless debug tool -- most words have hundreds if not tens of thousands of disjuncts, and picking over these is .. too hard.

I have a branch in which I applied the classic pruning results back into the expression pruning code.
The original intention was to speed up the SAT parser by supplying it with much smaller expressions, and also to prevent the duplicate linkages it has due to duplicate disjuncts in the row expressions it currently uses. It works perfectly (fast and providing the exact same linkages).

However, this code may serve to display compact expressions that don't produce disjuncts that got removed by eliminate_duplicate_disjuncts() , power_prune() and pp_prune() (say using !!word!c).

BTW, a similar idea may also serve to display expressions with the connectors that participate in the linkage marked in them.

@ampli ampli mentioned this issue Dec 13, 2019
@linas
Copy link
Member Author

linas commented Dec 13, 2019

However, this code may serve to display compact expressions that don't produce disjuncts that got removed

Not sure I like that. One of the common dictionary debugging problems is answering the quqstion "why didn't connector X attach to connector Y because it seems they should" and if these are already "silently" pruned away, that would be .. confusing. On the other hand, having much shorter disjunct lists would be nice. Perhaps two displays?

@ampli
Copy link
Member

ampli commented Dec 13, 2019

I guess it is possible to show the disjuncts before and after pp_and_power_prune(), and even to follow the bad option so pp_prune() is not done if specified. The question is what syntax to use for the link-parser command. You suggested !!word/regex/ without a flag, but then we cannot differentiate between the cases and have to show all possibilities while one of them would suffice.

@ampli
Copy link
Member

ampli commented Dec 25, 2019

"why didn't connector X attach to connector Y because it seems they should"

If the expression is displayed after pruning you may easily note the case in which the answer is "because it (X or Y) got pruned away".

!!word!d/filter_string/

Sure, why not. I assume filter_string is a regex, and so maybe that, but without the d. or something. since d normally means 'digits'.

There is a need for some way in the !!word command to differentiate between:

  1. Original dict expression/disjuncts.
  2. After expression_prune().
  3. After pp_and_power_prune() (applying pp_prune() can be controlled by !bad, and/or another flag).

The place for the flags may be before a leading / or after a trailing one.
Maybe instead of letters use numbers?
!!word/re/1
!!word/re/2
etc.?

@linas
Copy link
Member Author

linas commented Dec 25, 2019

After a trailing slash seems better, from a usability standpoint: one tries !!word, and then consults the docs, then up=arrow and add the backslash, -- less arrow-key usage to look at the multiple versions. Or even print all versions at once...

@ampli
Copy link
Member

ampli commented Dec 25, 2019

Or even print all versions at once...

The disjunct list is typically tens of thousands of lines... It is not so friendly thing if you only want the expression.

@ampli
Copy link
Member

ampli commented Jan 6, 2020

I finished to implement !!word/regex. Empty regex means all disjuncts. Flags are supported too as !!word/regex/flags but none of them are useful for now.
Since there may be overlapping changes with the current pending PRs, I will send a rebased PR after they are applied.

@ampli
Copy link
Member

ampli commented Jan 5, 2021

This discussion includes:

  1. Adding dialect support - done.
  2. Adding !!/word/ debug options. All done but one thing: Show disjunct after power pruning.
    Unless this seems to be useless, I will move that to another issue.

@linas
Copy link
Member Author

linas commented Jan 5, 2021

Show disjunct after power pruning.

I do not anticipate needing that.

@linas
Copy link
Member Author

linas commented Jan 5, 2021

So, I guess this issue can be closed!?

@linas
Copy link
Member Author

linas commented Jan 5, 2021

Oh, wait: it needs to be documented on the website ...

@ampli
Copy link
Member

ampli commented Jan 8, 2021

Eventually, I need to add it to the man page (on the next man page overhaul).

@ampli
Copy link
Member

ampli commented Mar 23, 2021

Eventually, I need to add it to the man page (on the next man page overhaul).

Added in 9a62ddc.

@ampli
Copy link
Member

ampli commented Mar 23, 2021

Issue #1172 got opened for completing the dialect API.
Closing this issue as "implemented".

@ampli ampli closed this as completed Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants