Connector enumeration + length-limit setup #673

ampli · 2018-02-14T20:51:35Z

I could not find (in order to link to it here) the issue in which I originally proposed the connector enumeration idea (you named this idea "enumeration" there).

For length-limit setup, see issue #632.

Main changes in this PR:

Connector enumeration (significant speedup).
While reading the dictionary, put the fixed properties of each connector in a central table, and
precompute all properties which their value is now computed for each parsed sentence.
In the connector structure, use a pointer to this table (called in the code the connector descriptor table).
This also enables using the slot number of a connector in this table as a "perfect hash".
Compute an encoded value for the lower-case part of connectors, for faster matching.
Add a facility to define specific length limit for connectors.
More efficient connector length-limit setup (also for UNLIMITED-CONNECTORS, and the
short_length/all_short_connectors parse options).
Define length limit for LL* in the Russian dict (big speedup - especially for long sentences) and for
some connectors in the English dict.
Add length-limit check to the expression pruning.
I didn't check if it may improve the result of the power pruning, but it reduces its CPU time.
Pack the connectors and disjuncts before parsing (notable speedup).
Code infrastructure: Add connector accessors.

Some additional changes:

Improve expression pruning debug.
Add debug print for the parsing connector table at verbosity=102 (in a preparation for reimplementing it
- a WIP).
List all the connectors, and their enumeration and length limits at verbosity=101.

Additional possible improvements using the new infrastructure:

Improve post-processing speed. Deep changes are need to add an argument to many functions, unless
a thread_local pointer is added, to point to the connector enumeration table.
Improve the speed of the fast-matcher by sorting the connector lists according to upper-case strings (to
improve the match cache hit ratio).
Reduce the size of the Connector struct (I intend to try even 4 bytes).
Reduce the size of a Table_connector element in count.c (made possible by packing the connectors - the connector index can used instead of its address).

Add printing pass and word numbers.

- The enumeration itself is not used yet. - More shortcuts per connector descriptor are to be added. - The SAT parser code doesn't compile with these changes - to be fixed.

It is zeroed by zero_connector_table(). New connectors are inserted in front of the list.

Move to it tableNext from the connector structure. Also tune it to be 32 bytes. uc_hush/uc_num and str_hash can be defined later with typedef, to enable creating a version which supports more connector types.

Also, use actual connector descriptor table entries (less and more efficient code).

Define 3 new ones: connector_get_lc_start() connector_get_uc_start() connector_uc_hash()

uc_num is the ordinal number of a UC connector. Use this accessor when the code depends on the ordinal number and not a cached hash value.

On the test sentence: The problem is, or rather one of the problems, for there are many, a sizeable proportion of which are continually clogging up the civil, commercial, and criminal courts in all areas of the Galaxy, and especially, where possible, the more corrupt ones, this. it runs 2.5% faster when doing this packing.

... instead of for every connector.

This prevents too long lines. No need for _get, as we don't have or need connector set functions.

This gain a few percents of performance on long sentences.

... at verbosity=102. It may provide insides on the implementation effectively and how to improve it.

The string-hash uses the same definition just because the alignment allows for that.

Apply them in the order they appear in the dict, to allow more specific definitions to override previous ones.

Also add a legend.

Also refactor the connector descriptor table code.

- Change a local link-parser alias "lg" to link-parser. - Break some long commands with backslash/newline. - Add info on -v=102 (print connectors). - Update info on debug configuration. - Fix markdown of intra-word underline (with it and without it there is a problem in some markdown readers...). - Fix info rot in the example of sane_linkage_morphism().

The first linkage contains special.n instead of special.a. The reason is that now the expected linkage doesn't appear in the first 100 linkages, which is the default win the LG Python binding (it appears as number 108...). Fix it by setting the limit to 300, in a hope that this is enough future proof.

linas · 2018-02-15T02:08:43Z

link-grammar/connectors.c

 	} else {
 		for (l=e->u.l; l!=NULL; l=l->next) {
-			build_connector_set_from_expression(conset, l->e);
+			get_connectors_from_expression(conlist, cl_size, l->e);
 		}


I think it would be slightly faster to return cl_size as a return value, rather than passing a pointer. When you increment a value at a location, the compiler has to assume a worst-case of possible aliasing, and constantly write it out to memory (cache). By contrast, a return value will usually stay in registers, maybe be written to stack occasionally.

You are right. I will fix that.

linas · 2018-02-15T02:18:38Z

link-grammar/connectors.c

 }

 /* ======================================================== */

+static bool connector_encode_lc(const char *lc_string, condesc_t *desc)


Can you provide a description of what this does? It seems to be taking the lower-case part of a connector, normally an 8-bit quantity, and packing it into a 7-bit thing .. that seems like a lot of work, for not much savings. Since it is lower-case, you could, in principle, pack down to 5 bits, for more savings ... but is it really worth the effort? or am I missing something?

OK, never mind. I see what you are doing .. you have to copy the LC chars into the u64 anyway, and you have to compute the mask, anyway, and, so at almost no extra cost, you can shrink to 7 bits. But then, at almost no extra cost, you can subtract 0x60 and pack in 5 bits. On the other hand, I do not believe we have have more than maybe 5 or 6 lower-case letters... so this is not all that necessary...

But then, at almost no extra cost, you can subtract 0x60 and pack in 5 bits. On the other hand, I do not believe we have have more than maybe 5 or 6 lower-case letters... so this is not all that necessary...

The LC part is not really only lower case letters. It may contain also numbers. So we have 36 possible characters, that can be packed into 6 bits. However, this is more complex than just removing the most significant bit since there is a gape between the values if the lower case letters and the numbers.

Not doing this packing means maximum 8 characters in the LC part, and I agree 9 is not a big improvement and may not even worth the slight overhead of doing it once. The reason I included it is copying from an old code I wrote, that included packing the UC part in 6 bits (the same packing code can be used). Packing the UC part is not needed now because the connector enumeration idea is better.

So I will remove this packing (or ifdef it out).

Also note that all[*] the connector related computations, including LC packing, are now done only once - at the dictionary read time, instead of per sentence (or even per connector) as before.
[*] Besides short_length/all_short_connectors, which can also be computed and cached only when changed, but I didn't implement it.

OK. Maybe its not that important.

But then, at almost no extra cost, you can subtract 0x60 and pack in 5 bits. On the other hand, I do not believe we have have more than maybe 5 or 6 lower-case letters... so this is not all that necessary...

With such a sustruction method (e.g substructing 64 and taking the the absolute) it is possible to pack in 6 bits instead of 7 (I was wrong in saying this is more complex due to the gape), and also the longest LC part is 5 - the version (and domain) encoding. So with this packing 32-bits can hold 5 characters. However, this will prevent versions like V5v4v10...

BTW, I once hacked the dict reading and added #define for dict definitions (a very small change) (# is already a special character for the dict).

Another BTW, for not opening an issue for that, I noted that reading the id dict fails
with: Connectors beginning with "ID" are forbidden. The easiest fix for that is to forbid only connectors in the format IDX when X is a capital letter string (as only that may clash with idiom connectors).

Is such a fix is fine?

ID: apparently, its needed, so that we can set the indonesian locale. So yes.

We do need version strings like V5v4v10 ...

Note that the statistically-generated dictionaries currently use only upper-case letters, and have zero lower-case letters in them. And there are a lot of them: I haven't counted recently, but maybe hundreds of thousands (or more). The goal of learning is to shrink this number, but ...

linas · 2018-02-15T02:33:55Z

link-grammar/connectors.h

+	                       * If 0, short_length (a Parse_Option) is used. If
+	                       * all_short==true (a Parse_Option), length_limit
+	                       * is clipped to short_length. */
+	char head_depended;   /* 'h' for head, 'd' for depended, or '\0' if none */


typoe: "dependent", not "depended"

I will fix it,

linas · 2018-02-15T02:42:37Z

link-grammar/connectors.h

+                                  const char *constring, int hash)
+{
+	*h = (condesc_t *)malloc(sizeof(condesc_t));
+	memset(*h, 0, sizeof(condesc_t));


I'm sort-of surprised by this malloc. Why not just change condesc_table_alloc so that it mallocs an array of condesc_t instead of an array of pointers to condesc_t ? That way, you can skip this smaller alloc entirely.

I don't know in advance how many different connectors are in the dict, so this hash table can grow. I didn't want to pre-allocate a an array that cannot grow.

However, I now see it can be done in another, more efficient, way:
Make a big malloc for the initial guessed number of connectors, as set in dictionary_six_str().
If exceeded, realloc double space. Use it as a condesc_t pool. Optionally, at the end, realloc to the exact used amount, in order to free the extra space. Another possibility is allocate by groups of condesc_t elements, say 256 and in that way absolutely most of the malloc overhead is saved.

The above realloc way is simpler. I can implement it.

sorry, I did not realize it was a hash table. It seemed to be getting used as a sequential list. I suppose I misread the code

Um, hang on. condesc_t doesn't have a next pointer in it, so it cannot be chained, so it cannot be used in a hash table, unless you can guarantee zero collisions..... A hash table with zero-or-one items in each bucket ... Hmmm. Are you using some trick to spill to the next empty bucket? There was some old code that did that.

OK, so you are saying that condesc_t is large enough that it is better to have an array of pointers to them, rather than an array of them?

Drawbacks of not using array of pointers:

The size of condesc_t is 32, so a table of say 20K items (i think this what it may end up for "ru") will have a significant memory overhead. Since it is too large for an L2 CPU cache, searching in it may cause several times more CPU cache misses (with 8 pointers per cache line you may normally have no more than 3 misses).

The sorting, which is essential here, is more costly (need to swap the elements content during sort).

However, I didn't made a benchmark that compares the different ways of doing it.

There is an additional way of doing it that maybe worth exploring: Use the dict creating code.
The idea is just to put the connector strings into their own dict, and enumerate them by listing this dict words (similarly to what !!*command does).

OK, that's OK. You can leave this alone; I was just reading through the code, got a little confused, and needed to ask a few questions. but this seems OK as is, No change needed.

On a different note, the tests/memleak unit test crashes with a memory double-free. That's important.

linas · 2018-02-15T02:45:26Z

link-grammar/dict-file/read-dict.c


 			n = make_or_node(&dict->exp_list, plu, min);
 		}
 		else
 		{
-			dict_error(dict, "Unknown connector direction type.");
+			dict_error(dict, "Unknown connector direction type '%c'.");
 			return NULL;


The connector direction is never given.

Indeed.
Since the second argument of dict_error2() is never used, I can change this call to dict_error2(), and change dict_error2() to print an additional value nicely.

A better alternative can be to change dict_error() to a printf-like function.

Um, hang on. condesc_t doesn't have a next pointer in it, so it cannot be chained, so it cannot be used in a hash table, unless you can guarantee zero collisions..... A hash table with zero-or-one items in each bucket ... Hmmm. Are you using some trick to spill to the next empty bucket? There was some old code that did that.

It is similar to the string_set code, in that it uses empty bucket (but in the case of my code I use a constant stride of 1, since it is more cache friendly, see below). It is also how it is done in the fast-matcher connector lists per word.

This can be improved by using Robin Hood placement (that I use in a WIP to replace the memoizing table of do_count()), and intend to try it here too.

The problem of using a next pointer for chaining is that it is very cache unfriendly, due to random memory access (the chained element has a very low chance to be in a close memory location). On the other hand, searching in an "in place" hash table accesses sequentially a relatively small memory region (very few elements are needed to be inspected in average and the worst case is small, if the table is kept, e.g, under 50% used), and due to the sequential access the CPUs knows to prefetch the next memory to be accessed.

linas · 2018-02-15T02:58:46Z

link-grammar/parse/parse.c

+
+#define CACHELINE 64
+	size_t dsize = dcnt * sizeof(Disjunct);
+	dsize = (dsize+CACHELINE)&~(CACHELINE-1); /* Align connector block. */


Here, you align, but then below, you add csize of unknown alignment. I'm sort-of confused, because you then use what malloc provided, without actually aligning -- malloc might provide an address that is not 64-byte aligned.

The memory arrangement in this allocation is:
DISJUNCT_ARRAY ALIGNEMNT_PAD CONNECTOR_ARRAY
The Connector struct is 32 bytes (and can be shrinked to 16, 8 and even 4 bytes - I would like to test all).
In order not to split connectors over two cache lines, I want CONNECTOR_ARRAY to start at 64-byte boundary. I get this here by adding an alignment pad at the end of the disjunct allocation.
(As far as I remember the Disjunct struc size is something like 44 bytes so there is no point to align its start and it doesn't usually end at the desired boundary.)

So it seems fine to me.
If desired, I can add a description like the above to the function comment.

As far as I remember the Disjunct struc size is something like 44

(It is actually 56.)

OK. If the code likes right to you, then OK, it looked strange to me but I did not try very hard to figure it uout.

I hope it will be clearer after I add comments to it.

Per discussion at PR opencog#673.

Address problems discussed in #673

ampli added 30 commits February 14, 2018 18:06

exprune.c: Improve debug

a595a05

Add printing pass and word numbers.

Connector enumeration: Initial work

db58ea9

- The enumeration itself is not used yet. - More shortcuts per connector descriptor are to be added. - The SAT parser code doesn't compile with these changes - to be fixed.

Remove tableNext initialization

3157204

It is zeroed by zero_connector_table(). New connectors are inserted in front of the list.

con_uc_eq(): Now we can just check the UC enumeration

bec62cf

More efficient easy_match

e860b3f

Prepare the connector descriptor to be used in exprune hash

98a15d3

Move to it tableNext from the connector structure. Also tune it to be 32 bytes. uc_hush/uc_num and str_hash can be defined later with typedef, to enable creating a version which supports more connector types.

Change the expression prune hash table to use connector descriptors

8ba831e

Also, use actual connector descriptor table entries (less and more efficient code).

More efficient size and zeroing of expression prune table

f020260

prune.c possible_connection(): Move connector match to the start

1c5c7e5

More efficient sizing of prune table

ea6ff70

Disjunct hash: Use the cached connector hash

ed7a808

Use connector attribute accessors

24a07c8

Define 3 new ones: connector_get_lc_start() connector_get_uc_start() connector_uc_hash()

Add and use connector accessor for uc_num

2dda39b

uc_num is the ordinal number of a UC connector. Use this accessor when the code depends on the ordinal number and not a cached hash value.

Remove dead code (replaced by connector enumeration related code)

0cb0378

Add FIXME on possible improvements with connector enumeration

8be7dc0

fast-match: Cache the connector descriptor instead of its string

e2d80db

short_len: clip to UNLIMITED_LEN once when set

4afaa9c

... instead of for every connector.

fast_match.c: Unify lc & h/d connector checks

a4dbafe

Use an encoded connector lc string for faster comparisons

b46c5fd

Rename connector_get_*() to connector_*()

a13a0b4

This prevents too long lines. No need for _get, as we don't have or need connector set functions.

Convert the sat-solver sources for connector descriptor

e0cfcad

Move *_connector_count() to disjunct-utils.c

db83a42

fast-matcher: Limit table sizes by total different connectors

d1910fe

This gain a few percents of performance on long sentences.

Add debug print for the parsing connector table

9d0a8d5

Remove unneeded zeroing of the parse context

510e580

Connector table: Add debug printout

34e398c

... at verbosity=102. It may provide insides on the implementation effectively and how to improve it.

Add FIXME to check lc connector length

920e809

Use connector_hash_size to allow >64K connectors if desired

28f124d

The string-hash uses the same definition just because the alignment allows for that.

exprune.c: Use contable.num_uc since we hash the uc parts

532d099

ampli added 12 commits February 14, 2018 19:18

Expression pruning: Use connector length limit

29700c0

ConTable/ConDesc: Remove redundant struct definitions

0bedfdf

Add data structures for dict length limit setup

2c1d979

Read length limit definitions from the dictionary

0a8045d

Apply them in the order they appear in the dict, to allow more specific definitions to override previous ones.

verbosity=101 (print connector descriptors): Add length limit

efb513a

Also add a legend.

Allow to set length limit by connector uppercase prefix

d4009c9

Update the Changelog (LENGTH-LIMIT-n)

935f566

set_condesc_length_limit(): Refactor common code

7c6138d

Auto-resize the connector descriptor tables

7bb4a0a

Also refactor the connector descriptor table code.

en dict: Add length-limit=1 for YS, YP and PHv

1cb0767

linas reviewed Feb 15, 2018

View reviewed changes

linas merged commit 3e116eb into opencog:master Feb 15, 2018

ampli mentioned this pull request Feb 15, 2018

Low memory handling - unimplemented #537

Open

ampli added a commit to ampli/link-grammar that referenced this pull request Feb 17, 2018

get_connectors_from_expression(): Return the list size

7e80eca

Per discussion at PR opencog#673.

This was referenced Feb 17, 2018

Address problems discussed in #673 #677

Merged

Implementing pool allocator #678

Closed

linas added a commit that referenced this pull request Feb 18, 2018

Merge pull request #677 from ampli/conenum

947d93c

Address problems discussed in #673

linas mentioned this pull request Feb 27, 2018

Making connector handling much faster #198

Closed

ampli mentioned this pull request Aug 8, 2019

Fast match improvements #986

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connector enumeration + length-limit setup #673

Connector enumeration + length-limit setup #673

ampli commented Feb 14, 2018

linas Feb 15, 2018

ampli Feb 15, 2018

linas Feb 15, 2018

linas Feb 15, 2018

ampli Feb 15, 2018 •

edited

Loading

linas Feb 15, 2018

ampli Feb 23, 2018

linas Feb 23, 2018

linas Feb 15, 2018

ampli Feb 15, 2018

linas Feb 15, 2018

ampli Feb 15, 2018

linas Feb 15, 2018

linas Feb 15, 2018

linas Feb 16, 2018

ampli Feb 16, 2018

linas Feb 16, 2018

linas Feb 15, 2018 •

edited

Loading

ampli Feb 15, 2018

ampli Feb 15, 2018 •

edited

Loading

linas Feb 15, 2018

ampli Feb 15, 2018

ampli Feb 15, 2018

linas Feb 15, 2018

ampli Feb 15, 2018

Connector enumeration + length-limit setup #673

Connector enumeration + length-limit setup #673

Conversation

ampli commented Feb 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ampli Feb 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linas Feb 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ampli Feb 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ampli Feb 15, 2018 •

edited

Loading

linas Feb 15, 2018 •

edited

Loading

ampli Feb 15, 2018 •

edited

Loading