Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making connector handling much faster #198

Closed
ampli opened this issue Oct 28, 2015 · 11 comments
Closed

Making connector handling much faster #198

ampli opened this issue Oct 28, 2015 · 11 comments
Assignees

Comments

@ampli
Copy link
Member

ampli commented Oct 28, 2015

Hi Linas,
I did profiling and found several "hot spots" in the regular parser (I will discuss the profiling of the SAT parser in another post).

Among them are connector comparisons.
post_process_match() is now the no. 1 CPU consumer.
easy_match() is no. 2.
(I neglect for now malloc/realloc/free, which take about 35% of the CPU...).

It turned out it is possible to speed-up these functions by much.
With these modifications, the batch benchmarks run much faster (~15% speed-up on fixes.batch).

The only drawback is that the new code is maybe less straight-forward (but adding appropriate comments may mitigate that).
The speed-up is based on these ideas:

  1. Assume that the connector syntax is valid. Thus, it is possible to assume that after the optional head/dependent marking, there is at least one uppercase letter. 2imilarly, it is possible to assume that the uppercase letters, if there are more than one, are consecutive.
  2. Use all the info we have at each point, in order to have the "shortest path" from entry to exit in each function.
  3. Use islower()/isupper() as few times as possible, since they are more costly than regular comparisons.
  4. Use !isupper() instead of islower(), to prevent cache trashing by an additional memory reference (yields a slight but yet an observable speed-up).

Here is how the faster post_process_match() looks like (still subject to change):

bool post_process_match(const char *s, const char *t)
{
    if (NULL == t) return false;
    if (!isupper((int)*t)) t++; /* Skip head-dependent indicator */

    if (*s != *t) return false;
    while (*++s == *++t)
    {
        if (*s == '\0') return true;
    }
    /* Ensure all the uppercase letters have matched. */
    if (*s == '\0') return !isupper((int)*t);
    if (isupper((int)*s)) return false;
    if ((*s == '#') && isupper((int)*t)) return false;

    do
    {
        if ((*s != '#') && (*s != ((*t == '\0') ? '*' : *t))) return false;
        if (*t != '\0') t++;
    } while (*++s != '\0');
    return true;
}

Here is some benchmark:
Average of 25 runs, each 5G iterations (interleaving runs of current and modified code).
(The times include the C program invocation and the loop overhead.)

A A A B AX BY ABc ABd hA dA hA hA AB# ABcv*b Pa##j Pabcj C# C*
current 27.2 19.9 20.0 33.6 20.6 19.5 35.7 41.0 33.8
modified 15.8 10.5 10.5 22.1 9.93 9.46 25.2 33.3 22.2

Here is a similar benchmark for the new easy_match():
Average of 12 runs, each 5G iterations (interleaving runs of current and modified code).
(The times include the C program invocation and the loop overhead.)

A A A B AX BY ABc ABd ABx ABx hA dA hA hA ABC*b ABCab Aab A*b A_b A_bc BA B* B* BA LLCMC LLCMD ABc AB AB ABc ABc ABce hA dAB
current 11.6 6.32 6.32 30.2 31.5 11.6 3.79 41.8 31.7 31.8 8.86 10.2 31.5 14.8 14.8 31.7 11.7
modified 7.59 5.05 5.06 12.9 11.6 10.1 3.78 31.7 18.3 11.4 12.6 11.4 17.9 10.2 9.02 11.5 8.86

I have an idea how to further speed-up these function, and also speed-up the connector hash calculation: When reading the dict, calculate the following for each connector: uppercase start offset, uppercase end offset, and maybe also total length and whether there is a wildcard. I intend to check if this can provide an additional significant speed-up.

Meanwhile, if the idea of improving the speed-up of easy_match() and post_process_match() this way seems to you fine, I can send a pull request.

I also did numerous other speed-up modifications, which their accumulated effect is still unknown. Maybe I will include them in the same pull request (as separate commits of course.)

@linas
Copy link
Member

linas commented Nov 4, 2015

The speed-up is based on these ideas:

  1. Assume that the connector syntax is valid. Thus, it is possible to
    assume that after the optional head/dependent marking, there is at least
    one uppercase letter. 2imilarly, it is possible to assume that the
    uppercase letters, if there are more than one, are consecutive.
  2. Use all the info we have at each point, in order to have the "shortest
    path" from entry to exit in each function.
  3. Use islower()/isupper() as few times as possible, since they are more
    costly than regular comparisons.
  4. Use !isupper() instead of islower(), to prevent cache trashing by an
    additional memory reference (yields a slight but yet an observable
    speed-up).

These seem to be reasonable assumptions.

Here is how the faster post_process_match() looks like (still subject to
change):

bool post_process_match(const char _s, const char *t)
{
if (NULL == t) return false;
if (!isupper((int)_t)) t++; /* Skip head-dependent indicator */

if (*s != *t) return false;
while (*++s == *++t)
{
    if (*s == '\0') return true;

this looks like a bug ... upper-case letters in t might not be matched.
for example, s=AA t=AAAA

}
/* Ensure all the uppercase letters have matched. */
if (*s == '\0') return !isupper((int)*t);
if (isupper((int)*s)) return false;
if ((*s == '#') && isupper((int)*t)) return false;

I don't think we use hash-marks anywhere in the dicts, right? There are
still hash-marks in the post-processing rules, but I don't think that uses
this code ..?

@ampli
Copy link
Member Author

ampli commented Nov 4, 2015

this looks like a bug ... upper-case letters in t might not be matched.
for example, s=AA t=AAAA

I'm not sure I understand what you say: If s=AA and t=AAAA, there should be no match.
Note that if (*s == '\0'), then also (*t == '\0') because it is inside the while() and this ensures that they are equal at that point. It also means everything before that was equal.

I don't think we use hash-marks anywhere in the dicts, right? There are
still hash-marks in the post-processing rules, but I don't think that uses
this code ..?

There are indeed no "#" marks in the dict.
However, the post-processing rule does use this code!
To check that I added this in the current code:

bool post_process_match(const char *s, const char *t)
{
       ...
    if (*s != '#')
        {
            char c;
            if (*t == '\0') c = '*'; else c = *t;
            if (*s != c) return false;
        } else printf("POST_PROCESS_MATCH: *s=%s, *t=%s\n", s, t);

and got this:

POST_PROCESS_MATCH: *s=#c, *t=*b
POST_PROCESS_MATCH: *s=##c, *t=s*b
POST_PROCESS_MATCH: *s=#c, *t=*b
POST_PROCESS_MATCH: *s=#c, *t=*b
POST_PROCESS_MATCH: *s=##c, *t=s*b
POST_PROCESS_MATCH: *s=#c, *t=*b
POST_PROCESS_MATCH: *s=#c, *t=
POST_PROCESS_MATCH: *s=##c, *t=s
POST_PROCESS_MATCH: *s=#c, *t=
POST_PROCESS_MATCH: *s=#i, *t=*t

@ampli
Copy link
Member Author

ampli commented Nov 4, 2015

Meanwhile I found another method to perform connector comparisons in a negligible CPU time comparing to the current code. This method has the problem (which may not be actually a real problem) that it assumes the connectors consist of ASCII only, and each contains a sequence of no more than 12 capital letter and then a sequence of no more than 9 additional letters (in addition to the optional h/d at the start). But it seems enough.
The said method needs a pre-compiling of the connector strings (similarly to the idea of pre-compiling of regexes).
But an infrastructure for that (when reading the dict) may allow to do more pre-computations, that are now done for each sentence separately and take significant CPU time. For example, for match_in_connector_set().

Here is my new easy_match():

static inline bool easy_match(struct connector *s, struct connector *t)
{
    if (s->uc != t->uc) return false;
    if ((s->lc ^ t->lc) & s->mask&t->mask) return false;
    if (s->hd && (s->hd == t->hd)) return false;
    return true;
}

The struct members uc , lc and mask are uint64_t, and hd is int.
The uppercase and lowercase (actually anything after the uppercase) letters are packed into uc and lc, correspondingly. The mask member contains 0 bits at the place of *, else 1 bits.

For most of the comparisons, the only work of this function is to return false in its first statement.
I don't know yet exactly how much total CPU this idea can save when implemented in all the relevant places (in addition to easy_match()), but I think it has a potential for a considerable saving.

If the idea of pre-computation for connectors (at dict read time) seems to you fine, I will test it.

@linas
Copy link
Member

linas commented Nov 11, 2015

On Wed, Nov 4, 2015 at 3:01 PM, Amir Plivatsky [email protected]
wrote:

this looks like a bug ... upper-case letters in t might not be matched.
for example, s=AA t=AAAA

I'm not sure I understand what you say: If s=AA and t=AAAA, there should
be no match.

right.

Note that if (_s == '\0'), then also (_t == '\0') because it is inside
the while() and this ensures that they are equal at that point.

OH, OK, right. I misread the code.

There are indeed no "#" marks in the dict.

Yes, right sorry, I forgot that this was the post-process match and not the
regular match. Late night email

@linas
Copy link
Member

linas commented Nov 11, 2015

Precompiling for easy-match seems like an excellent idea. Limiting to ASCII is OK, so is limiting to 12 cap lettters. I understand where the 12 comes from; I don't see why lower-case is only 9 though. Can't the mask handle it all? Whatever. 9 is an OK limit too.

These are OK limits for the hand-built dicts, and are OK for the machine-learned dicts, too. Go for it!

@ampli
Copy link
Member Author

ampli commented Nov 11, 2015

For the "lowercase letters" part, I packed each of the letters as 7 bits. However, it can be done as 6 bits too, still allowing all the useful characters besides lowercase letters (e.g. numbers). This allows for an additional character.

Meanwhile I found a better encoding for the uppercase part (than 5-bit packing):

  1. Pre-sort all connectors according to their uppercase part.
  2. Encode the uppercase part of each connector as its ordinal number in (1).

This allows to pack up to 64K different unlimited length uppercase parts into a uint16_t, which is also much more cache friendly than a uint64_t.

Pre-handling all the connectors at the dict creating step may have yet another major benefit: their hash can be pre-computed, instead of doing it as done now - per sentence (which is one of the major overhead areas). In theory, one of the many free perfect-hash functions can be used. They can easily handle millions of keys, so I hope handling several thousands keys will be light enough for a one-time dict handling step. This of course needs a check.

In addition, each connector could be pre-marked if it belongs to the UNLIMITED-CONNECTORS set, another thing that is now checked (using a hash) per sentence (#11 in the CPU profiling).

@linas
Copy link
Member

linas commented Nov 12, 2015

Yes, these are all excellent ideas. The only observation I have to offer is that it was "hard" to hash the connectors, as they were so short, there was little variety. Perhaps enumerating them can help. I tried several different hashes, and the shortest, simplest ones seemed to work best.

@linas
Copy link
Member

linas commented Feb 27, 2018

Recent work, mostly in #673 but also in #677 and #678 appears to have addressed this issue -- is this issue still alive, or can it be closed?

@ampli
Copy link
Member Author

ampli commented Feb 27, 2018

Quoting form my opening post:

post_process_match() is now the no. 1 CPU consumer.

It still consumes considerable percentage.
The following have the potential of a considerable speedup:

  1. Use connector enumeration, in order that it will use the new fast easy_match.
    This has an implementation difficulty: A link to the connector table is needed in the lower level functions. It meas changing a large number of functions just to pass this pointer (that is not directly related to them) to the functions they call. However, such a change is easy and straightforward. Instead this pointer could be made a global TLS, but I am not convinced this is better, and a direct implementation of that prevents using two open dicts in the same thread (a thing that seems to me useful).

    Please advice.

BTW, the post processing is also a major bottleneck in the SAT parser code, and I once tried investigating its encoding in SAT (only the postprocessing pruning is currently encoded).

  1. Use memory pools (TBD).

easy_match() is no. 2.

By now is far away down in the list.
Recently I found how to encode the h/d in the remaining unused bit of the encoding (only 9*7 bits are currently used).
This has 2 benefits: 1. Getting rid of the head_dependent field (ConTable struct), in a try to trim it (after removing more fields that are not needed during do_count()) to 16 bits (from 32). 2. Hopefully making a speedup - it is compiled too much less assembly code, and with not even a single branch.
Here is the tried code:
In connector_encode_lc():

        desc->lc_mask = (lc_mask << 1) + (desc->head_dependent != '\0');
        desc->lc_letters = (lc_value << 1) + (desc->head_dependent == 'h');

In lc_easy_match():

return (((c1->lc_letters ^ c2->lc_letters) & c1->lc_mask & c2->lc_mask) ==
                (c1->lc_mask & c2->lc_mask & 1));

However, I couldn't prove it makes the total run faster (still a mystery). So I will use it only if it will be the last obstacle to the said trimming to 16 bytes.

So these 2 issues are open and I am leaving this issue open.

BTW, the major CPU hog is now the do_count() memorizing table (The CPU is stale most of the time while accessing it, because it cannot be cached, and cannot be accurately sized. I am trying to 1: Shrink its elements.. 2. Totally reimplement it with an extendible hash table. 3. Add another memoizing, that hopefully will make it possible to dynamically removing hash entries that are not needed any more.
For long sentences it takes most of the CPU (>50%). This is a current WIP.

@linas
Copy link
Member

linas commented Feb 28, 2018

A link to the connector table is needed in the lower level functions. It meas changing a large number of functions just to pass this pointer

I would prefer that over TLS. Note that passing extra arguments through functions adds overhead as well, especially if the order of the arguments has to be altered. But this overhead is minor compared to cache misses.

@ampli
Copy link
Member Author

ampli commented Nov 8, 2019

Updates:

  • The h/d encoding for easy_match is now used, but the connector struct is still 32 bytes.
  • An access to the sentence structure from function that handle connectors/disjuncts may provide more room in the connector/disjunct structures. This is still investigated, but no need for an open issue for that.
  • Optimizing the do_count() cache is a WIP.

So I'm closing this issue.

@ampli ampli closed this as completed Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants