Design Docs‎ > ‎Data Structures‎ > ‎

ICU Code Point Tries

Fast lookup in arrays

For fast lookup by code point, we store data in arrays. It costs too much space to use a single array indexed directly by code points: There are about 1.1M of them (max 0x10ffff, about 20.1 bits), and about 90% are unassigned or private use code points. For some uses, there are non-default values only for a few hundred characters.

We use a form of "trie" adapted to single code points. The bits in the code point integer are divided into two or more parts. The first part is used as an array offset, the value there is used as a start offset into another array. The next code point bit field is used as an additional offset into that array, to fetch another value. The final part yields the data for the code point. Non-final arrays are called index arrays or tables.

See also ICU String Tries.

For lookup of arbitrary code points, we need at least three successive arrays, so that the first index table is not too large.

For all but the first index table, different blocks of code points with the same values can overlap. A special block contains only default values and is shared among all blocks of code points that map there.

Block sharing works better, and thus leads to smaller data structures, the smaller the blocks are, that is, the fewer bits in the code point bit fields used as intra-block offsets.

On the other hand, shorter bit fields require more bit fields and more successive arrays and lookups, which adds code size and makes lookups slower.

(Until about 2001, all ICU data structures only handled BMP code points. "Compact arrays" split 16-bit code points into fields of 9 and 7 bits.)

We tend to make compromises including additional index tables for smaller parts of the Unicode code space, for simpler, faster lookup there.

For a general-purpose structure, we want to be able to be able to store a unique value for every character. This determines the number of bits needed in the last index table. With 136,690 characters assigned in Unicode 10, we need at least 18 bits. We allocate data values in blocks aligned at multiples of 4, and we use 16-bit index words shifted left by 2 bits. This leads to a small loss in how densely the data table can be used, and how well it can be compacted, but not nearly as much as if we were using 32-bit index words.

Character conversion

The ICU conversion code uses several variants of code point tries with data values of 1, 2, 3, or 4 bytes corresponding to the number of bytes in the output encoding.

UTrie

The original "UTrie" structure was developed for Unicode Normalization for all of Unicode. It was then generalized for collation, character properties, and eventually almost every Unicode data lookup. Values are 16 or 32 bits wide.

It was designed for fast UTF-16 lookup with a special, complicated structure for supplementary code points using custom values for lead surrogate units. This custom data and code made this structure relatively hard to use.

11:5 bits for the BMP and effectively 5:5:5:5 bits for supplementary code points provide for good compaction. The BMP index table is always 211 uint16_t = 4kB. Small index blocks for the supplementary range are added as needed.

The structure stores different values for lead surrogate code units (for fast moving through UTF-16_ vs. code points (for lookup by code point).

The first 256 data table entries are a fixed-size, linear table for Latin-1 (up to U+00FF).

UTrie2

The "UTrie2" structure, developed in 2008, was designed to enable fast lookup from UTF-8 without always having to assemble whole code points and to split them again into the trie bit fields.

It retains separate lookups for lead surrogate code units vs. code points.

It retains the same 11:5 lookup for BMP code points, for good compaction and good performance.

There is a special small index for lead bytes of two-byte UTF-8 sequences (up to U+07FF), for 5:6 lookup. These index values are not shifted left by 2.

Lookup for three-byte UTF-8 uses the BMP index, which is clumsy.

Lookup for supplementary code points is much simpler than with UTrie, without custom data values or code. Two index tables are used for 9:6:5 code point bits. The first index table omits the BMP part. The structure stores the a code point after which every one maps to the default value, and the first index is truncated to below that.

With the fixed BMP index table and other required structures, an empty UTrie2 is about 5kB large.

The UTF-8 lookup was also designed for the original handling of ill-formed UTF-8: The first 192 data table entries are a linear table for ASCII plus the 64 trail bytes, to look up "single" bytes 0..BF without further checking, with error values for the trail bytes. Lookup of two-byte non-shortest forms (C0 80..C1 BF) also yields error values. These error values became unused in 2017 when ICU 60 changed to handling ill-formed UTF-8 compatible with the W3C Encoding standard (substituting maximal subparts of valid sequences). C0 and C1 are no longer recognized as lead bytes, requiring full byte sequence validation separate from the data lookup.

Ideas

Possible goals: Simpler code, smaller data especially for sparse tries, maybe faster UTF-8, not much slower UTF-16.

We should try to store only one set of values for surrogates. Unicode property APIs use only by-code point lookup without special lead surrogate values. Collation uses special lead surrogate data but does not use code point lookup. Normalization does both, but the per-code point lookup could test for surrogate code points first and return trivial values for all of them. UTF-16 string lookup should map unpaired surrogates to the error value.

We should remove the special data for old handling of ill-formed UTF-8, the error values for trail bytes and two-byte non-shortest forms.

If we use 6 bits for the last code point bit field, then we can use the same index table for code point/UTF-16 lookup as well as UTF-8 lookup. Compaction will be less effective, so data will grow some. This would be somewhat compensated by the smaller BMP index table.

If we also continue to use 6 bits for the second-to-last table, that is, 8:6:6 bits, then we can simplify the code for three- and four-byte UTF-8.

If we always include the BMP in the first index table, then we can also simplify enumeration code a bit, and use smaller code for code point lookups where code size is more important than maximum speed.

Alternatively, we could improve compaction and speed for the BMP by using no index shift-left for BMP indexes (and keep omitting the BMP part of the first index table). In order to ensure that BMP data can be indexed directly with 16-bit index values, the builder would probably have to copy at least the BMP data into a new array for compaction, before adding data for supplementary code points. When some of the indexes are not shifted, and their data is compacted to arbitrary offsets, then that data cannot also be addressed with uniform double-index lookup. We may or may not store unused first-index entries. If not the whole BMP is indexed differently, then UTF-16 and three-byte UTF-8 lookups need another code branch. (Size vs. simplicity & speed.)

The more tries we use, the higher the total cost of the size overhead. (For example, many of the 100 or so collation tailorings carry a UTrie2.) The less overhead, the more we could use separate tries where we currently combine them or avoid them. Smaller overhead would make it more attractive to offer a public code point map structure.

Going to 10:6 bits for the BMP cuts the fixed-size index in half, to 2kB.

We could reduce the fixed-size index table much further by using two-index lookup for some or most of the BMP, trading off data size for speed and simplicity. The index must be at least 32 uint16_t's for two-byte UTF-8, for up to U+07FF including Cyrillic and Arabic. We could experiment with length 64 for U+0FFF including Indic scripts and Thai, 208 entries for U+33FF (most small scripts and Kana), or back to 1024 entries for the whole BMP. We could configure a different value at build time for different services (optimizing for speed vs. size). If we use the faster lookup for three-byte UTF-8, then the boundaries should be multiples of 0x1000 (up to U+3FFF instead of U+33FF).

Comments