Unicode strings with zero-byte values inside the key cannot be inserted #6

kellydunn · 2015-09-18T06:40:56Z

As discovered by @nick-codes, it looks like inserting a unicode key into a new ArtTree panics with an index out of range error.

Here's some code to reproduce the error:

// Tests a Unicode insertion.
func TestArtTreeInsertWithInternalZeroByte(t *testing.T) {
    tree := NewArtTree()

    // ‘a' followed by unicode accent character.
    accent := []byte{0x61, 0x00, 0x60}
    tree.Insert([]byte("a"), "a")
    tree.Insert(accent, string(accent))

    res := tree.Search(accent)
    if res == nil {
        t.Errorf("Could not find Leaf Node with expected key: '%s'", accent)
    } else {
        if !bytes.Equal(res.([]byte), accent) {
            t.Error("Unexpected search result.")
        }
    }
}

Acceptance critieria:

A user can insert a unicode string with an empty byte value inside of it.

cvermilion · 2017-02-21T15:25:16Z

Just ran into this one in my project. Unfortunately I do not see an easy solution.

The issue isn't zero bytes precisely, but the fact that the library uses a null terminator to ensure that no key is a full prefix of another key. That's the case that needs to be avoided, and what this test produces. If the null character is allowed in the input, I don't see a simple way to ensure that no key is a prefix of another one, though perhaps an adaptive approach could be used.

Another approach (I think I've seen this in other radix tree implementations) would be for internal nodes to have a special leaf node pointer, and use child node pointers exclusively for pointing to other inner nodes. This would require tweaking the algorithm somewhat, and I haven't thought through the performance implications.

For UTF-8 encoded strings, this shouldn't come up unless you allow null characters in the strings. For UTF-16 or UTF-32, though, you can have zero bytes easily. For a particular encoding, the problem can be solved by using a null terminator appropriate to the encoding (eg, "\x00\x00" for UTF-16), but there isn't a general solution.

I'll probably just add null-char checking on my end for now, but I wanted to leave my thoughts here in case anyone else wanted to tackle this (perhaps me in the future).

cvermilion · 2017-02-21T15:26:39Z

(Apparently it's also an undocumented limitation of the C version that you can't have full-prefix keys, and null-terminators are similarly advised there.)

kellydunn added the bug label Sep 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode strings with zero-byte values inside the key cannot be inserted #6

Unicode strings with zero-byte values inside the key cannot be inserted #6

kellydunn commented Sep 18, 2015

cvermilion commented Feb 21, 2017

cvermilion commented Feb 21, 2017

Unicode strings with zero-byte values inside the key cannot be inserted #6

Unicode strings with zero-byte values inside the key cannot be inserted #6

Comments

kellydunn commented Sep 18, 2015

cvermilion commented Feb 21, 2017

cvermilion commented Feb 21, 2017