Making Unicode things fast in Go
By “Unicode things” I mean programs that process UTF-8. Some shameless plugs examples
include iterating words & graphemes, or determining the display width of strings.
The basic logic of such programs is to:
- Decode the runes (codepoints) from a source UTF-8 string
- Look up their categories
- Do something with that information
It looks something like this:
for _, r := range myString { // "DecodeRune" is called under the hood
if unicode.Is(r, category1) || unicode.Is(r, category3) { // Binary searches on a range table
// do something based on category
}
}
It’s all fine. But! Runes are a deserialization, which we only use temporarily. Multiple category lookups. Can we do better?
Yes, and the answer is a prefix trie, a type of KV (key-value) data structure. The K is a slice of UTF-8 bytes, and the V is a bitmap of Unicode categories. It looks like this:
type category uint8
const (
category1 category = 1 << iota
category2
category3
category4
)
func (this category) is(that category) bool {
return (this & that) != 0
}
position := 0
for position < len(myString) {
cat := trie.lookup(myString[position:]) // no decoding
if cat.is(category1|category3) { // bitwise "is"
// do something based on category
}
}
This turns out to be 30%-50% faster in my experience. No allocations, no runes, one lookup retrieves all categories. Here are some comparative benchmarks.
The trie is pretty compact, since each unique byte is only stored once,
and multiple categories can be bitwise |
’d into a single integer.
We have to code-gen the trie! Happily, there is a package deep in the x/text module, called triegen. It’s made for exactly this, and is used in x/text, so of course my idea is not original.
Here is how I use it.