Skip to content

Commit

Permalink
Invent "rainbow" arcs within the regex engine.
Browse files Browse the repository at this point in the history
Some regular expression constructs, most notably the "." match-anything
metacharacter, produce a sheaf of parallel NFA arcs covering all
possible colors (that is, character equivalence classes).  We can make
a noticeable improvement in the space and time needed to process large
regexes by replacing such cases with a single arc bearing the special
color code "RAINBOW".  This requires only minor additional complication
in places such as pull() and push().

Callers of pg_reg_getoutarcs() must now be prepared for the possibility
of seeing a RAINBOW arc.  For the one known user, contrib/pg_trgm,
that's a net benefit since it cuts the number of arcs to be dealt with,
and the handling isn't any different than for other colors that contain
too many characters to be dealt with individually.

This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.

Patch by me, reviewed by Joel Jacobson

Discussion: https://postgr.es/m/[email protected]
  • Loading branch information
tglsfdc committed Feb 20, 2021
1 parent 1766118 commit 08c0d6a
Show file tree
Hide file tree
Showing 10 changed files with 177 additions and 37 deletions.
27 changes: 18 additions & 9 deletions contrib/pg_trgm/trgm_regexp.c
Original file line number Diff line number Diff line change
Expand Up @@ -282,8 +282,8 @@ typedef struct
typedef int TrgmColor;

/* We assume that colors returned by the regexp engine cannot be these: */
#define COLOR_UNKNOWN (-1)
#define COLOR_BLANK (-2)
#define COLOR_UNKNOWN (-3)
#define COLOR_BLANK (-4)

typedef struct
{
Expand Down Expand Up @@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
palloc0(colorsCount * sizeof(TrgmColorInfo));

/*
* Loop over colors, filling TrgmColorInfo about each.
* Loop over colors, filling TrgmColorInfo about each. Note we include
* WHITE (0) even though we know it'll be reported as non-expandable.
*/
for (i = 0; i < colorsCount; i++)
{
Expand Down Expand Up @@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
/* Add enter key to this state */
addKeyToQueue(trgmNFA, &destKey);
}
else
else if (arc->co >= 0)
{
/* Regular color */
/* Regular color (including WHITE) */
TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];

if (colorInfo->expandable)
Expand Down Expand Up @@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
addKeyToQueue(trgmNFA, &destKey);
}
}
else
{
/* RAINBOW: treat as unexpandable color */
destKey.prefix.colors[0] = COLOR_UNKNOWN;
destKey.prefix.colors[1] = COLOR_UNKNOWN;
destKey.nstate = arc->to;
addKeyToQueue(trgmNFA, &destKey);
}
}

pfree(arcs);
Expand Down Expand Up @@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
/*
* Ignore non-expandable colors; addKey already handled the case.
*
* We need no special check for begin/end pseudocolors here. We
* don't need to do any processing for them, and they will be
* marked non-expandable since the regex engine will have reported
* them that way.
* We need no special check for WHITE or begin/end pseudocolors
* here. We don't need to do any processing for them, and they
* will be marked non-expandable since the regex engine will have
* reported them that way.
*/
if (!colorInfo->expandable)
continue;
Expand Down
36 changes: 28 additions & 8 deletions src/backend/regex/README
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,18 @@ and the NFA has these arcs:
states 4 -> 5 on color 2 ("x" only)
which can be seen to be a correct representation of the regex.

There is one more complexity, which is how to handle ".", that is a
match-anything atom. We used to do that by generating a "rainbow"
of arcs of all live colors between the two NFA states before and after
the dot. That's expensive in itself when there are lots of colors,
and it also typically adds lots of follow-on arc-splitting work for the
color splitting logic. Now we handle this case by generating a single arc
labeled with the special color RAINBOW, meaning all colors. Such arcs
never need to be split, so they help keep NFAs small in this common case.
(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
not supposed to match newline. In that case we still handle "." by
generating an almost-rainbow of all colors except newline's color.)

Given this summary, we can see we need the following operations for
colors:

Expand Down Expand Up @@ -349,18 +361,20 @@ The possible arc types are:

PLAIN arcs, which specify matching of any character of a given "color"
(see above). These are dumped as "[color_number]->to_state".
In addition there can be "rainbow" PLAIN arcs, which are dumped as
"[*]->to_state".

EMPTY arcs, which specify a no-op transition to another state. These
are dumped as "->to_state".

AHEAD constraints, which represent a "next character must be of this
color" constraint. AHEAD differs from a PLAIN arc in that the input
character is not consumed when crossing the arc. These are dumped as
">color_number>->to_state".
">color_number>->to_state", or possibly ">*>->to_state".

BEHIND constraints, which represent a "previous character must be of
this color" constraint, which likewise consumes no input. These are
dumped as "<color_number<->to_state".
dumped as "<color_number<->to_state", or possibly "<*<->to_state".

'^' arcs, which specify a beginning-of-input constraint. These are
dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
Expand Down Expand Up @@ -396,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
the end of the input.
3. If the NFA is (or can be) in the goal state at this point, it matches.

This definition is necessary to support regexes that begin or end with
constraints such as \m and \M, which imply requirements on the adjacent
character if any. The executor implements that by checking if the
adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
right color, and it does that in the same loop that checks characters
within the match.

So one can mentally execute an untransformed NFA by taking ^ and $ as
ordinary constraints that match at start and end of input; but plain
arcs out of the start state should be taken as matches for the character
before the target substring, and similarly, plain arcs leading to the
post state are matches for the character after the target substring.
This definition is necessary to support regexes that begin or end with
constraints such as \m and \M, which imply requirements on the adjacent
character if any. NFAs for simple unanchored patterns will usually have
pre-state outarcs for all possible character colors as well as BOS and
BOL, and post-state inarcs for all possible character colors as well as
EOS and EOL, so that the executor's behavior will work.
After the optimize() transformation, there are explicit arcs mentioning
BOS/BOL/EOS/EOL adjacent to the pre-state and post-state. So a finished
NFA for a pattern without anchors or adjacent-character constraints will
have pre-state outarcs for RAINBOW (all possible character colors) as well
as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
22 changes: 21 additions & 1 deletion src/backend/regex/regc_color.c
Original file line number Diff line number Diff line change
Expand Up @@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
{
struct colordesc *cd = &cm->cd[a->co];

assert(a->co >= 0);
if (cd->arcs != NULL)
cd->arcs->colorchainRev = a;
a->colorchain = cd->arcs;
Expand All @@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
struct colordesc *cd = &cm->cd[a->co];
struct arc *aa = a->colorchainRev;

assert(a->co >= 0);
if (aa == NULL)
{
assert(cd->arcs == a);
Expand All @@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,

/*
* rainbow - add arcs of all full colors (but one) between specified states
*
* If there isn't an exception color, we now generate just a single arc
* labeled RAINBOW, saving lots of arc-munging later on.
*/
static void
rainbow(struct nfa *nfa,
Expand All @@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
struct colordesc *end = CDEND(cm);
color co;

if (but == COLORLESS)
{
newarc(nfa, type, RAINBOW, from, to);
return;
}

/* Gotta do it the hard way. Skip subcolors, pseudocolors, and "but" */
for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
!(cd->flags & PSEUDO))
Expand All @@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
/*
* colorcomplement - add arcs of complementary colors
*
* We add arcs of all colors that are not pseudocolors and do not match
* any of the "of" state's PLAIN outarcs.
*
* The calling sequence ought to be reconciled with cloneouts().
*/
static void
colorcomplement(struct nfa *nfa,
struct colormap *cm,
int type,
struct state *of, /* complements of this guy's PLAIN outarcs */
struct state *of,
struct state *from,
struct state *to)
{
Expand All @@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
color co;

assert(of != from);

/* A RAINBOW arc matches all colors, making the complement empty */
if (findarc(of, PLAIN, RAINBOW) != NULL)
return;

for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
if (findarc(of, PLAIN, co) == NULL)
Expand Down
82 changes: 74 additions & 8 deletions src/backend/regex/regc_nfa.c
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
*
* This function checks to make sure that no duplicate arcs are created.
* In general we never want duplicates.
*
* However: in principle, a RAINBOW arc is redundant with any plain arc
* (unless that arc is for a pseudocolor). But we don't try to recognize
* that redundancy, either here or in allied operations such as moveins().
* The pseudocolor consideration makes that more costly than it seems worth.
*/
static void
newarc(struct nfa *nfa,
Expand Down Expand Up @@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,

/*
* cloneouts - copy out arcs of a state to another state pair, modifying type
*
* This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
* the same interpretation of "co". It wouldn't be sensible with LACONs.
*/
static void
cloneouts(struct nfa *nfa,
Expand All @@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
struct arc *a;

assert(old != from);
assert(type == AHEAD || type == BEHIND);

for (a = old->outs; a != NULL; a = a->outchain)
{
assert(a->type == PLAIN);
newarc(nfa, type, a->co, from, to);
}
}

/*
Expand Down Expand Up @@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
for (a = from->ins; a != NULL && !NISERR(); a = nexta)
{
nexta = a->inchain;
switch (combine(con, a))
switch (combine(nfa, con, a))
{
case INCOMPATIBLE: /* destroy the arc */
freearc(nfa, a);
Expand All @@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
cparc(nfa, a, s, to);
freearc(nfa, a);
break;
case REPLACEARC: /* replace arc's color */
newarc(nfa, a->type, con->co, a->from, to);
freearc(nfa, a);
break;
default:
assert(NOTREACHED);
break;
Expand Down Expand Up @@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
for (a = to->outs; a != NULL && !NISERR(); a = nexta)
{
nexta = a->outchain;
switch (combine(con, a))
switch (combine(nfa, con, a))
{
case INCOMPATIBLE: /* destroy the arc */
freearc(nfa, a);
Expand All @@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
cparc(nfa, a, from, s);
freearc(nfa, a);
break;
case REPLACEARC: /* replace arc's color */
newarc(nfa, a->type, con->co, from, a->to);
freearc(nfa, a);
break;
default:
assert(NOTREACHED);
break;
Expand All @@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
* #def INCOMPATIBLE 1 // destroys arc
* #def SATISFIED 2 // constraint satisfied
* #def COMPATIBLE 3 // compatible but not satisfied yet
* #def REPLACEARC 4 // replace arc's color with constraint color
*/
static int
combine(struct arc *con,
combine(struct nfa *nfa,
struct arc *con,
struct arc *a)
{
#define CA(ct,at) (((ct)<<CHAR_BIT) | (at))
Expand All @@ -1827,14 +1849,46 @@ combine(struct arc *con,
case CA(BEHIND, PLAIN):
if (con->co == a->co)
return SATISFIED;
if (con->co == RAINBOW)
{
/* con is satisfied unless arc's color is a pseudocolor */
if (!(nfa->cm->cd[a->co].flags & PSEUDO))
return SATISFIED;
}
else if (a->co == RAINBOW)
{
/* con is incompatible if it's for a pseudocolor */
if (nfa->cm->cd[con->co].flags & PSEUDO)
return INCOMPATIBLE;
/* otherwise, constraint constrains arc to be only its color */
return REPLACEARC;
}
return INCOMPATIBLE;
break;
case CA('^', '^'): /* collision, similar constraints */
case CA('$', '$'):
case CA(AHEAD, AHEAD):
if (con->co == a->co) /* true duplication */
return SATISFIED;
return INCOMPATIBLE;
break;
case CA(AHEAD, AHEAD): /* collision, similar constraints */
case CA(BEHIND, BEHIND):
if (con->co == a->co) /* true duplication */
return SATISFIED;
if (con->co == RAINBOW)
{
/* con is satisfied unless arc's color is a pseudocolor */
if (!(nfa->cm->cd[a->co].flags & PSEUDO))
return SATISFIED;
}
else if (a->co == RAINBOW)
{
/* con is incompatible if it's for a pseudocolor */
if (nfa->cm->cd[con->co].flags & PSEUDO)
return INCOMPATIBLE;
/* otherwise, constraint constrains arc to be only its color */
return REPLACEARC;
}
return INCOMPATIBLE;
break;
case CA('^', BEHIND): /* collision, dissimilar constraints */
Expand Down Expand Up @@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
break;
case LACON:
assert(s->no != cnfa->pre);
assert(a->co >= 0);
ca->co = (color) (cnfa->ncolors + a->co);
ca->to = a->to->no;
ca++;
Expand Down Expand Up @@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
switch (a->type)
{
case PLAIN:
fprintf(f, "[%ld]", (long) a->co);
if (a->co == RAINBOW)
fprintf(f, "[*]");
else
fprintf(f, "[%ld]", (long) a->co);
break;
case AHEAD:
fprintf(f, ">%ld>", (long) a->co);
if (a->co == RAINBOW)
fprintf(f, ">*>");
else
fprintf(f, ">%ld>", (long) a->co);
break;
case BEHIND:
fprintf(f, "<%ld<", (long) a->co);
if (a->co == RAINBOW)
fprintf(f, "<*<");
else
fprintf(f, "<%ld<", (long) a->co);
break;
case LACON:
fprintf(f, ":%ld:", (long) a->co);
Expand Down Expand Up @@ -3161,7 +3225,9 @@ dumpcstate(int st,
pos = 1;
for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
{
if (ca->co < cnfa->ncolors)
if (ca->co == RAINBOW)
fprintf(f, "\t[*]->%d", ca->to);
else if (ca->co < cnfa->ncolors)
fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
else
fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
Expand Down
Loading

0 comments on commit 08c0d6a

Please sign in to comment.