-
Notifications
You must be signed in to change notification settings - Fork 31
/
critbit.w
578 lines (450 loc) · 18.6 KB
/
critbit.w
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
\input tikz.tex
\centerline{\titlefont Crit-bit Trees}
\vskip 15pt
\centerline{Adam Langley ({\tt agl@@imperialviolet.org})}
\centerline{(Version {\tt 20080926})}
@ Introduction
This code is taken from Dan Bernstein's {\tt qhasm} and implements a binary
crit-bit (alsa known as PATRICA) tree for |NUL| terminated strings. Crit-bit
trees are underused and it's this author's hope that a good example will aid
their adoption.
Internal nodes in a crit-bit store a position in the input and two children. The
position is the next location in which two members differ (the {\it crit}ical
{\it bit}). For a given set of elements there is a unique crit-bit tree representing
that set, thus a crit-bit tree does not need complex balancing algorithms. The
depth of a crit-bit tree is bounded by the length of the longest element,
rather than the number of elements (as with an unbalanced tree). Thus, if the
crit-bit tree is defined on a finite domain (say, the set of 32-bit integers)
then the maximum depth is 32, since no two 32-bit integers can differ in the
$33^{rd}$ bit.
Crit-bit trees also support the usual tree operations quickly:
membership-testing, insertion, removal, predecessor, successor and easy
iteration. For |NUL| terminated strings they are especially helpful since they
don't require an expensive string comparison at each step.
This code, like Prof.~ Bernstein's original code, is released into the public
domain. It can be found at {\tt http://\-github.com/\-agl/\-critbit}.
@ Structures
We start with the structures used in the crit-bit tree. We'll cover the
semantics of each of the members of these structures as need arises.
@d _POSIX_C_SOURCE 200112
@d uint8 uint8_t
@d uint32 uint32_t
@f critbit0_node int
@f critbit0_tree int
@f uint8 int
@c
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <errno.h>
typedef struct {
void *child[2];
uint32 byte;
uint8 otherbits;
} critbit0_node;
typedef struct {
void *root;
} critbit0_tree;
@* Membership testing.
The first function that we'll deal with will be membership testing. The
following function takes a tree, |t|, and a |NUL| terminated string, |u|, and
returns non-zero iff |u| $\in$ |t|.
@c
int
critbit0_contains(critbit0_tree *t, const char *u) {
const uint8 *ubytes = (void *) u;
const size_t ulen = strlen(u);
uint8 *p = t->root;
@<Test for empty tree@>@;
@<Walk tree for best member@>@;
@<Check for successful membership@>@;
}
@ An empty tree
An empty tree simply is simply one where the root pointer is |NULL|, (that's
{\tt NULL} for those who are new to {\tt CWEB}).
@<Test for empty tree@>=
if (!p) return 0;
@ Searching the tree
Once we have established that the tree is not empty, it therefore has one or
more members. Now we need to distinguish between internal and external nodes.
Internal nodes are |critbit0_node| structures. They record that the tree
diverges at some point. External nodes are allocated strings. Thus, a tree with
a single member is one where the root pointer points at an allocated string.
However, we need to be able to test a given pointer to know if it points at an
internal or external node. Several possibilities present themselves: a common
structure as a prefix to both the internal and external nodes, tags associated
with every pointer, {\it etc}. In this case, we include the tag in the pointer
itself as the least-significant bit. We assume that both types of nodes are
aligned to, at least, two bytes and thus the LSB is free to be used as a tag bit.
Internal nodes are tagged with a |1| and external nodes are tagged with a
|0|.
When walking the tree we obviously want to break out when we reach an external
node. Thus we use a |while| loop that tests that the current node pointer is
always pointing at an internal node.
@<Walk tree for best member@>=
while (1 & (intptr_t) p) {
critbit0_node *q = (void *) (p - 1);
@<Calculate direction@>@;
p = q->child[direction];
}
@ Encoding a location
Recall that a crit-bit tree works by encoding the bit-number that differs at
each branch in the tree. The obvious way to do this would either be with a
single number (the number of bits from the beginning of the string), or with a
(byte number, bit number $\in [0..7]$) pair.
However, for reasons that should become clear later, here we encode it as a
byte number and a single byte where all the bits {\it except} the critical bit
are true. By performing a bitwise OR with the correct byte there are only two
results: If the byte did not have the critical bit set, the result is the same
as the mask. If it did, the result is all ones. The latter case is the unique
8-bit value where adding one and right-shifting 8 places results in a 1. We use
this to obtain the direction.
Note also that our strings are treated as if they had an infinitely long suffix
of |NUL| bytes following them. Thus, if the critical bit is beyond the end of
our string, we treat it as if it had a zero bit there.
@<Calculate direction@>=
uint8 c = 0;
if (q->byte < ulen) c = ubytes[q->byte];
const int direction = (1 + (q->otherbits | c)) >> 8;
@ The final test
Once we have reached an external node we can only conclude that certain
bits of the string are shared with a string in the tree. We still need to
test the best match to make sure that it's correct. If the test fails, however,
we can conclude that the string is not in the tree.
Note that the pointer cannot be |NULL|. We tested that the root pointer was not
|NULL| at the start of the function and, if an internal node had a |NULL| pointer
then the tree would be invalid - that internal node should be removed.
@<Check for successful me...@>=
return 0 == strcmp(u, (const char *) p);
@* Inserting into the tree.
This is a more complex function. It takes a tree, |t|, and possibly mutates it
such that a |NUL| terminated string, |u|, is a member on exit. It returns:
$\cases{ 0 &if {\rm out of memory} \cr
1 &if {\it u} {\rm was already a member} \cr
2 &if {\it t} {\rm was mutated successfully}}$
Note that the section for walking the tree is the same as before and is not
covered again.
@c
int critbit0_insert(critbit0_tree *t, const char *u)
{
const uint8 *const ubytes = (void *) u;
const size_t ulen = strlen(u);
uint8 *p = t->root;
@<Deal with inserting into an empty tree@>@;
@<Walk tree for best member@>@;
@<Find the critical bit@>@;
@<Insert new string@>@;
return 2;
}
@ Inserting into an empty tree
Recall that an empty tree has a |NULL| root pointer. A singleton tree, the
result of inserting into the empty tree, has the root pointing at an external
node - the sole member of the tree.
We require the ability to malloc a buffer with alignment 2 and so use
|posix_memalign| to allocate memory.
@<Deal with inser...@>=
if (!p) {
char *x;
int a = posix_memalign((void **) &x, sizeof(void *), ulen + 1);
if (a) return 0;
memcpy(x, u, ulen + 1);
t->root = x;
return 2;
}
@ Finding the critical bit
@<Find the critical bit@>=
@<Find differing byte@>@;
@<Find differing bit@>@;
@ Finding the differing byte
Now that we have found the best match for the new element in the tree we need
to check to see where the new element differs from that element. If it doesn't
differ, of course, then the new element already exists in the tree and we can
return 1. Remember that we treat strings as if they had an infinite number of
|NUL|s following them and that the best match string might be longer than |u|.
While calculating the differing byte we also calculate |newotherbits|, the XOR
of the differing byte. This will become clear in the next section.
@<Find differing byte@>=
uint32 newbyte;
uint32 newotherbits;
for (newbyte = 0; newbyte < ulen; ++newbyte) {
if (p[newbyte] != ubytes[newbyte]) {
newotherbits = p[newbyte] ^ ubytes[newbyte];
goto different_byte_found;
}
}
if (p[newbyte] != 0) {
newotherbits = p[newbyte];
goto different_byte_found;
}
return 1;
different_byte_found:
@ Finding the differing bit
Once we have the XOR of first differing byte in |newotherbits| we need to find
the most significant differing bit. We could do this with a simple for loop,
testing bits 7..0, instead we use the following trick:
We recursively fold the upper bits into the lower bits to yield a byte |x| with
all true bits below the most significant bit. Then |x & ~(x >> 1)| yields the
most significant bit.
Once we have this value, we invert all the bits resulting in a value suitable
for our |otherbits| member.
@<Find differing bit@>=
newotherbits |= newotherbits >> 1;
newotherbits |= newotherbits >> 2;
newotherbits |= newotherbits >> 4;
newotherbits = (newotherbits & ~(newotherbits >> 1)) ^ 255;
uint8 c = p[newbyte];
int newdirection = (1 + (newotherbits | c)) >> 8;
@ Inserting the new node
@<Insert new string@>=
@<Allocate new node structure@>@;
@<Insert new node@>@;
@ Allocating a new node
This is obviously fairly pedestrian code. Again, we use |posix_memalign| to
make sure that our node structures have an alignment of at least two. We store
the new copy of the string into the correct |child| pointer and save the other
for when we have worked out where to insert the new node
@<Allocate new ...@>=
critbit0_node *newnode;
if (posix_memalign((void **) &newnode, sizeof(void *), sizeof(critbit0_node))) return 0;
char *x;
if (posix_memalign((void **) &x, sizeof(void *), ulen + 1)) {
free(newnode);
return 0;
}
memcpy(x, ubytes, ulen + 1);
newnode->byte = newbyte;
newnode->otherbits = newotherbits;
newnode->child[1 - newdirection] = x;
@ Inserting a new node in the tree
Here we must recall that, for a given set of elements, there is a unique
crit-bit tree representing them. This statement needs a little bit of
qualification because it also requires that we define a total ordering of
crit-bits.
Consider the set of bitstrings $\{{\tt 000}, {\tt 001}, {\tt 101}\}$, inserted
into a crit-bit tree in that order. One could imagine the resulting tree
looking like this:
\centerline{
\tikzpicture
\usetikzlibrary{trees}
\colorlet{lightgray}{black!25}
[edge from parent fork down]
\node{root}
child {node [fill=lightgray, rounded corners] {$3^{rd}$}
child {node {{\tt 000}}}
child {node [fill=lightgray, rounded corners] {$1^{st}$}
child {node {{\tt 001}}}
child {node {{\tt 101}}}}};
\endtikzpicture
}
(Where internal nodes are shaded light gray and contain the critical bit,
counting from the left.)
That would be a valid tree for searching as far as our searching algorithm
goes, but it does make a mess of predecessor and successor operations when the
forks might not test the bits in any special order.
So, in short, we need the order of the crit-bits to match the lexicographical
order that we expect the predecessor and successor operations to follow. Thus,
inserting the new node in the tree involves walking the tree from the root to
find the correct position to insert at.
We keep track of the pointer to be updated (to point to the new internal node)
and, once the walk has finished, we can update that pointer.
@<Insert new node@>=
void **wherep = &t->root;
for (;;) {
uint8 *p = *wherep;
if (!(1 & (intptr_t) p)) break;
critbit0_node *q = (void *) (p - 1);
if (q->byte > newbyte) break;
if (q->byte == newbyte && q->otherbits > newotherbits) break;
uint8 c = 0;
if (q->byte < ulen) c = ubytes[q->byte];
const int direction = (1 + (q->otherbits | c)) >> 8;
wherep = q->child + direction;
}
newnode->child[newdirection] = *wherep;
*wherep = (void *) (1 + (char *) newnode);
@* Deleting elements.
This function takes a tree, |t|, and a |NUL| terminated string, |u|, and
possibly mutates the tree such that $u \notin t$. It returns 1 if the tree was
mutated, 0 otherwise.
@c
int critbit0_delete(critbit0_tree *t, const char *u) {
const uint8 *ubytes = (void *) u;
const size_t ulen = strlen(u);
uint8 *p = t->root;
void **wherep = &t->root;
void **whereq = 0;
critbit0_node *q = 0;
int direction = 0;
@<Deal with deleting from an empty tree@>@;
@<Walk the tree for the best match@>@;
@<Check the best match@>@;
@<Remove the element and/or node@>@;
return 1;
}
@ Deleting from the empty tree
Since no element is the member of the empty tree, this is a very easy case: we
can just return 0.
@<Deal with deleting from an...@>=
if (!p) return 0;
@ Finding the best candidate to delete
Walking the tree to find the best match for a given element is almost the same
as the two previous versions that we've seen. The only exception is that we
keep track of the last jump to an internal node in |whereq|. Actually, we keep
track of a pointer to the last pointer that got us to an internal node.
To see why, consider the typical case:
\centerline{
\tikzpicture
\usetikzlibrary{trees}
\colorlet{lightgray}{black!25}
[edge from parent fork down]
\node{root}
child {node(parent) [fill=lightgray, rounded corners] {$x$}
child {node {$\ldots$}}
child {node [fill=lightgray, rounded corners] {$y$}
child {node {$\ldots$}}
child {node {1100}}}};
\node[shift=(parent.center),xshift=2cm] (l) {whereq};
\draw[<-] (parent) -- (l);
\endtikzpicture
}
Here we wish to remove {\tt 1100}, however if we left its parent with a single
child pointer, that would make the parent nothing more than a bump in the road - it
should also be removed. Thus we need a pointer to the grandparent in order to
remove both the string and the internal node that pointed to it.
@<Walk the tree for the best match@>=
while (1 & (intptr_t) p) {
whereq = wherep;
q = (void *) (p - 1);
uint8 c = 0;
if (q->byte < ulen) c = ubytes[q->byte];
direction = (1 + (q->otherbits | c)) >> 8;
wherep = q->child + direction;
p = *wherep;
}
@ Checking that we have the right element
As usual, we have now found the best match, an external node, but we still need
to compare the strings to check that we actually have a match. If we don't,
then the element cannot be in the tree and we can return 0. Otherwise, the
external node is no longer useful and can be freed.
@<Check the best match@>=
if (0 != strcmp(u, (const char *) p)) return 0;
free(p);
@ Removing the node
We now have to deal with two cases. The simple case is as outlined in the
diagram above: we remove the parent node and point the grand parent to to other
child of the parent.
We also have to keep in mind that there might not {\it be} a grandparent node.
This is the case when the tree only has one element. In this case, we remove
that element and set the root pointer to |NULL|.
@<Remove the element...@>=
if (!whereq) {
t->root = 0;
return 1;
}
*whereq = q->child[1 - direction];
free(q);
@* Clearing a tree.
Clearing a tree (freeing all members) brings us our first code for walking the
whole tree rather than just tracing a path through it.
So, the |critbit0_clear| function takes a tree, |t|, and frees every member of
it, mutating the tree such that it is empty on exit.
@c
static void
traverse(void *top) {
@<Recursively free current node@>@;
}
void critbit0_clear(critbit0_tree *t)
{
if (t->root) traverse(t->root);
t->root = NULL;
}
@ Recursively clearing the tree
Each pointer in the tree has to be tested to see if it's a pointer to an
internal node (a |critbit0_node|) or to a malloced string. If it's a node, we
need to recursively free its children.
@<Recursively free...@>=
uint8 *p = top;
if (1 & (intptr_t) p) {
critbit0_node *q = (void *) (p - 1);
traverse(q->child[0]);
traverse(q->child[1]);
free(q);
} else {
free(p);
}
@* Fetching elements with a given prefix.
One of the operations which crit-bit trees can perform efficiently that hash
tables cannot is the extraction of the subset of elements with a given prefix.
The following function takes a tree, |t|, and a |NUL| terminated string,
|prefix|. Let $S \subseteq t$ where $x \in S$ iff |prefix| is a prefix of |x|,
then $\forall x : S.$ |handle| is called with arguments |x| and |arg|. It
returns:
$\cases{ 0 &if {\it handle} {\rm returned 0} \cr
1 &if {\rm successful} \cr
2 &if {\it handle} {\rm returned a value} $\notin [0,1]$}$
(Note that, if |handle| returns 0, the iteration is aborted)
@c
static int
allprefixed_traverse(uint8 *top,
int (*handle) (const char *, void *), void *arg) {
@<Deal with an internal node@>@;
@<Deal with an external node@>@;
}
int
critbit0_allprefixed(critbit0_tree *t, const char *prefix,
int (*handle) (const char *, void *), void *arg) {
const uint8 *ubytes = (void *) prefix;
const size_t ulen = strlen(prefix);
uint8 *p = t->root;
uint8 *top = p;
if (!p) return 1; /* S = $\emptyset$ */
@<Walk tree, maintaining top pointer@>@;
@<Check prefix@>@;
return allprefixed_traverse(top, handle, arg);
}
@ Maintaining the |top| pointer
The |top| pointer points to the internal node at the top of the subtree which
contains exactly the subset of elements matching the given prefix. Since our
critbit values are sorted as we descend the tree, this subtree exists (if the
subset is non-empty) and can be detected by checking for the critbit advancing
beyond the length of the prefix.
@<Walk tree, maint...@>=
while (1 & (intptr_t) p) {
critbit0_node *q = (void *) (p - 1);
uint8 c = 0;
if (q->byte < ulen) c = ubytes[q->byte];
const int direction = (1 + (q->otherbits | c)) >> 8;
p = q->child[direction];
if (q->byte < ulen) top = p;
}
@ Checking that the prefix exists
As with our other functions, it's possible that the given prefix doesn't
actually exist in the tree at this point. We need to check the actual contents
of the external node that we have arrived at.
@<Check prefix@>=
for (size_t i = 0; i < ulen; ++i) {
if (p[i] != ubytes[i]) return 1;
}
@ Dealing with an internal node while recursing
The |allprefixed_traverse| function is called with the root of a subtree as the
|top| argument. We need to test the LSB of this pointer to see if it's an
internal node. If so, we recursively walk down the subtree and return. Otherwise
we fall through into the code from the section below for handling an external
node.
@<Deal with an internal node@>=
if (1 & (intptr_t) top) {
critbit0_node *q = (void *) (top - 1);
for (int direction = 0; direction < 2; ++direction)
switch(allprefixed_traverse(q->child[direction], handle, arg)) {
case 1: break;
case 0: return 0;
default: return -1;
}
return 1;
}
@ Dealing with an external node while recursing
An external node is a malloced string that matches the given prefix. Thus we
call the callback and we're done.
@<Deal with an external node@>=
return handle((const char *) top, arg);