Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connector enumeration + length-limit setup #673

Merged
merged 42 commits into from
Feb 15, 2018
Merged

Conversation

ampli
Copy link
Member

@ampli ampli commented Feb 14, 2018

I could not find (in order to link to it here) the issue in which I originally proposed the connector enumeration idea (you named this idea "enumeration" there).

For length-limit setup, see issue #632.

Main changes in this PR:

  • Connector enumeration (significant speedup).
    While reading the dictionary, put the fixed properties of each connector in a central table, and
    precompute all properties which their value is now computed for each parsed sentence.
    In the connector structure, use a pointer to this table (called in the code the connector descriptor table).
    This also enables using the slot number of a connector in this table as a "perfect hash".
  • Compute an encoded value for the lower-case part of connectors, for faster matching.
  • Add a facility to define specific length limit for connectors.
  • More efficient connector length-limit setup (also for UNLIMITED-CONNECTORS, and the
    short_length/all_short_connectors parse options).
  • Define length limit for LL* in the Russian dict (big speedup - especially for long sentences) and for
    some connectors in the English dict.
  • Add length-limit check to the expression pruning.
    I didn't check if it may improve the result of the power pruning, but it reduces its CPU time.
  • Pack the connectors and disjuncts before parsing (notable speedup).
  • Code infrastructure: Add connector accessors.

Some additional changes:

  • Improve expression pruning debug.
  • Add debug print for the parsing connector table at verbosity=102 (in a preparation for reimplementing it
    - a WIP).
  • List all the connectors, and their enumeration and length limits at verbosity=101.

Additional possible improvements using the new infrastructure:

  • Improve post-processing speed. Deep changes are need to add an argument to many functions, unless
    a thread_local pointer is added, to point to the connector enumeration table.
  • Improve the speed of the fast-matcher by sorting the connector lists according to upper-case strings (to
    improve the match cache hit ratio).
  • Reduce the size of the Connector struct (I intend to try even 4 bytes).
  • Reduce the size of a Table_connector element in count.c (made possible by packing the connectors - the connector index can used instead of its address).

ampli added 30 commits February 14, 2018 18:06
Add printing pass and word numbers.
- The enumeration itself is not used yet.
- More shortcuts per connector descriptor are to be added.
- The SAT parser code doesn't compile with these changes - to be fixed.
It is zeroed by zero_connector_table().
New connectors are inserted in front of the list.
Move to it tableNext from the connector structure.
Also tune it to be 32 bytes.

uc_hush/uc_num and str_hash can be defined later with typedef,
to enable creating a version which supports more connector types.
Also, use actual connector descriptor table entries (less and more
efficient code).
Define 3 new ones:
connector_get_lc_start()
connector_get_uc_start()
connector_uc_hash()
uc_num is the ordinal number of a UC connector.
Use this accessor when the code depends on the ordinal number and not a
cached hash value.
On the test sentence:
The problem is, or rather one of the problems, for there are many, a
sizeable proportion of which are continually clogging up the civil,
commercial, and criminal courts in all areas of the Galaxy, and
especially, where possible, the more corrupt ones, this.

it runs 2.5% faster when doing this packing.
... instead of for every connector.
This prevents too long lines.
No need for _get, as we don't have or need connector set functions.
This gain a few percents of performance on long sentences.
... at verbosity=102.

It may provide insides on the implementation effectively and how to
improve it.
The string-hash uses the same definition just because the alignment
allows for that.
ampli added 12 commits February 14, 2018 19:18
Apply them in the order they appear in the dict, to allow more
specific definitions to override previous ones.
Also refactor the connector descriptor table code.
- Change a local link-parser alias "lg" to link-parser.
- Break some long commands with backslash/newline.
- Add info on -v=102 (print connectors).
- Update info on debug configuration.
- Fix markdown of intra-word underline (with it and without it there
  is a problem in some markdown readers...).
- Fix info rot in the example of sane_linkage_morphism().
The first linkage contains special.n instead of special.a.
The reason is that now the expected linkage doesn't appear in the first
100 linkages, which is the default win the LG Python binding (it
appears as number 108...).

Fix it by setting the limit to 300, in a hope that this is enough future
proof.
} else {
for (l=e->u.l; l!=NULL; l=l->next) {
build_connector_set_from_expression(conset, l->e);
get_connectors_from_expression(conlist, cl_size, l->e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be slightly faster to return cl_size as a return value, rather than passing a pointer. When you increment a value at a location, the compiler has to assume a worst-case of possible aliasing, and constantly write it out to memory (cache). By contrast, a return value will usually stay in registers, maybe be written to stack occasionally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I will fix that.

}

/* ======================================================== */

static bool connector_encode_lc(const char *lc_string, condesc_t *desc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a description of what this does? It seems to be taking the lower-case part of a connector, normally an 8-bit quantity, and packing it into a 7-bit thing .. that seems like a lot of work, for not much savings. Since it is lower-case, you could, in principle, pack down to 5 bits, for more savings ... but is it really worth the effort? or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, never mind. I see what you are doing .. you have to copy the LC chars into the u64 anyway, and you have to compute the mask, anyway, and, so at almost no extra cost, you can shrink to 7 bits. But then, at almost no extra cost, you can subtract 0x60 and pack in 5 bits. On the other hand, I do not believe we have have more than maybe 5 or 6 lower-case letters... so this is not all that necessary...

Copy link
Member Author

@ampli ampli Feb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then, at almost no extra cost, you can subtract 0x60 and pack in 5 bits. On the other hand, I do not believe we have have more than maybe 5 or 6 lower-case letters... so this is not all that necessary...

The LC part is not really only lower case letters. It may contain also numbers. So we have 36 possible characters, that can be packed into 6 bits. However, this is more complex than just removing the most significant bit since there is a gape between the values if the lower case letters and the numbers.

Not doing this packing means maximum 8 characters in the LC part, and I agree 9 is not a big improvement and may not even worth the slight overhead of doing it once. The reason I included it is copying from an old code I wrote, that included packing the UC part in 6 bits (the same packing code can be used). Packing the UC part is not needed now because the connector enumeration idea is better.

So I will remove this packing (or ifdef it out).

Also note that all[*] the connector related computations, including LC packing, are now done only once - at the dictionary read time, instead of per sentence (or even per connector) as before.
[*] Besides short_length/all_short_connectors, which can also be computed and cached only when changed, but I didn't implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Maybe its not that important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then, at almost no extra cost, you can subtract 0x60 and pack in 5 bits. On the other hand, I do not believe we have have more than maybe 5 or 6 lower-case letters... so this is not all that necessary...

With such a sustruction method (e.g substructing 64 and taking the the absolute) it is possible to pack in 6 bits instead of 7 (I was wrong in saying this is more complex due to the gape), and also the longest LC part is 5 - the version (and domain) encoding. So with this packing 32-bits can hold 5 characters. However, this will prevent versions like V5v4v10...

BTW, I once hacked the dict reading and added #define for dict definitions (a very small change) (# is already a special character for the dict).

Another BTW, for not opening an issue for that, I noted that reading the id dict fails
with: Connectors beginning with "ID" are forbidden. The easiest fix for that is to forbid only connectors in the format IDX when X is a capital letter string (as only that may clash with idiom connectors).

Is such a fix is fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ID: apparently, its needed, so that we can set the indonesian locale. So yes.

We do need version strings like V5v4v10 ...

Note that the statistically-generated dictionaries currently use only upper-case letters, and have zero lower-case letters in them. And there are a lot of them: I haven't counted recently, but maybe hundreds of thousands (or more). The goal of learning is to shrink this number, but ...

* If 0, short_length (a Parse_Option) is used. If
* all_short==true (a Parse_Option), length_limit
* is clipped to short_length. */
char head_depended; /* 'h' for head, 'd' for depended, or '\0' if none */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typoe: "dependent", not "depended"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it,

const char *constring, int hash)
{
*h = (condesc_t *)malloc(sizeof(condesc_t));
memset(*h, 0, sizeof(condesc_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sort-of surprised by this malloc. Why not just change condesc_table_alloc so that it mallocs an array of condesc_t instead of an array of pointers to condesc_t ? That way, you can skip this smaller alloc entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know in advance how many different connectors are in the dict, so this hash table can grow. I didn't want to pre-allocate a an array that cannot grow.

However, I now see it can be done in another, more efficient, way:
Make a big malloc for the initial guessed number of connectors, as set in dictionary_six_str().
If exceeded, realloc double space. Use it as a condesc_t pool. Optionally, at the end, realloc to the exact used amount, in order to free the extra space. Another possibility is allocate by groups of condesc_t elements, say 256 and in that way absolutely most of the malloc overhead is saved.

The above realloc way is simpler. I can implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I did not realize it was a hash table. It seemed to be getting used as a sequential list. I suppose I misread the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, hang on. condesc_t doesn't have a next pointer in it, so it cannot be chained, so it cannot be used in a hash table, unless you can guarantee zero collisions..... A hash table with zero-or-one items in each bucket ... Hmmm. Are you using some trick to spill to the next empty bucket? There was some old code that did that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so you are saying that condesc_t is large enough that it is better to have an array of pointers to them, rather than an array of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drawbacks of not using array of pointers:

  1. The size of condesc_t is 32, so a table of say 20K items (i think this what it may end up for "ru") will have a significant memory overhead. Since it is too large for an L2 CPU cache, searching in it may cause several times more CPU cache misses (with 8 pointers per cache line you may normally have no more than 3 misses).
  2. The sorting, which is essential here, is more costly (need to swap the elements content during sort).

However, I didn't made a benchmark that compares the different ways of doing it.

There is an additional way of doing it that maybe worth exploring: Use the dict creating code.
The idea is just to put the connector strings into their own dict, and enumerate them by listing this dict words (similarly to what !!*command does).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's OK. You can leave this alone; I was just reading through the code, got a little confused, and needed to ask a few questions. but this seems OK as is, No change needed.

On a different note, the tests/memleak unit test crashes with a memory double-free. That's important.


n = make_or_node(&dict->exp_list, plu, min);
}
else
{
dict_error(dict, "Unknown connector direction type.");
dict_error(dict, "Unknown connector direction type '%c'.");
return NULL;
Copy link
Member

@linas linas Feb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connector direction is never given.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
Since the second argument of dict_error2() is never used, I can change this call to dict_error2(), and change dict_error2() to print an additional value nicely.

A better alternative can be to change dict_error() to a printf-like function.

Copy link
Member Author

@ampli ampli Feb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, hang on. condesc_t doesn't have a next pointer in it, so it cannot be chained, so it cannot be used in a hash table, unless you can guarantee zero collisions..... A hash table with zero-or-one items in each bucket ... Hmmm. Are you using some trick to spill to the next empty bucket? There was some old code that did that.

It is similar to the string_set code, in that it uses empty bucket (but in the case of my code I use a constant stride of 1, since it is more cache friendly, see below). It is also how it is done in the fast-matcher connector lists per word.

This can be improved by using Robin Hood placement (that I use in a WIP to replace the memoizing table of do_count()), and intend to try it here too.

The problem of using a next pointer for chaining is that it is very cache unfriendly, due to random memory access (the chained element has a very low chance to be in a close memory location). On the other hand, searching in an "in place" hash table accesses sequentially a relatively small memory region (very few elements are needed to be inspected in average and the worst case is small, if the table is kept, e.g, under 50% used), and due to the sequential access the CPUs knows to prefetch the next memory to be accessed.


#define CACHELINE 64
size_t dsize = dcnt * sizeof(Disjunct);
dsize = (dsize+CACHELINE)&~(CACHELINE-1); /* Align connector block. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you align, but then below, you add csize of unknown alignment. I'm sort-of confused, because you then use what malloc provided, without actually aligning -- malloc might provide an address that is not 64-byte aligned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory arrangement in this allocation is:
DISJUNCT_ARRAY ALIGNEMNT_PAD CONNECTOR_ARRAY
The Connector struct is 32 bytes (and can be shrinked to 16, 8 and even 4 bytes - I would like to test all).
In order not to split connectors over two cache lines, I want CONNECTOR_ARRAY to start at 64-byte boundary. I get this here by adding an alignment pad at the end of the disjunct allocation.
(As far as I remember the Disjunct struc size is something like 44 bytes so there is no point to align its start and it doesn't usually end at the desired boundary.)

So it seems fine to me.
If desired, I can add a description like the above to the function comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember the Disjunct struc size is something like 44

(It is actually 56.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. If the code likes right to you, then OK, it looked strange to me but I did not try very hard to figure it uout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it will be clearer after I add comments to it.

@linas linas merged commit 3e116eb into opencog:master Feb 15, 2018
ampli added a commit to ampli/link-grammar that referenced this pull request Feb 17, 2018
linas added a commit that referenced this pull request Feb 18, 2018
Address problems discussed in #673
@ampli ampli mentioned this pull request Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants