Making Unicode things fast in Go

By “Unicode things” I mean programs that process UTF-8. Some ~~shameless plugs~~ examples include iterating words & graphemes, or determining the display width of strings.

The basic logic of such programs is to:

Decode the runes (codepoints) from a source UTF-8 string
Look up their categories
Do something with that information

It looks something like this:

for _, r := range myString {  // "DecodeRune" is called under the hood
    if unicode.Is(r, category1) || unicode.Is(r, category3) {  // Binary searches on a range table
        // do something based on category
    }
}

It’s all fine. But! Runes are a deserialization, which we only use temporarily. Multiple category lookups. Can we do better?

Yes, and the answer is a prefix trie, a type of KV (key-value) data structure. The K is a slice of UTF-8 bytes, and the V is a bitmap of Unicode categories. It looks like this:

type category uint8

const (
	category1 category = 1 << iota
	category2
	category3
	category4
)

func (this category) is(that category) bool {
	return (this & that) != 0
}

position := 0

for position < len(myString) {
    cat, w := trie.lookup(myString[position:])   // no rune decoding
    position += w

    if cat.is(category1|category3) { // check multiple categories
        // do something based on category
    }
}

This turns out to be 30%-50% faster in my experience. No allocations, no runes, one lookup retrieves all categories. Here are some comparative benchmarks.

The trie is pretty compact, since each unique byte is only stored once, and multiple categories can be bitwise |’d into a single integer.

We have to code-gen the trie! Happily, there is a package deep in the x/text module, called triegen. It’s made for exactly this, and is used in x/text, so of course my idea is not original.

Here is how I use it.