UCA Weight Allocation

We have been reviewing the primary weight allocation that we use for the UCA. This is sparked by two issues: (1) we need to do some work for Script Reordering anyway, and (2) the allocation of weights has not kept up with the growth of Unicode, causing a degradation in performance and sortkey size. 

Background

The particular problems we are are:

See also:

Here is a rough plan for what to do.

Primary Weights

Emit data for first bytes that shouldn’t be compressed. That is, instead of the primary compression ranges being hard-coded, they will be read out of data incorporated into the Fractional UCA table.

For script reordering and tailoring to work together, when a character is tailored before the first character in a script, or after the last character in a script, the first byte for that character's CE must still be in the script. 

Variables

Non-Variables

Assign first bytes as follows:

This is mostly for script reordering - it might cost a byte or two.

For the remaining scripts (see Script-Reordering-Chart)

* Any sequence of characters normal has a gap between each, in 2-byte space. So if we have <C U C>, that turns into <C g U g U g C> (where C = Common, U = uncommon, g = gap). When the uncommon characters are turned into 3-byter, then we don't need a two-byter gap. So we will get instead <C g UU C>. There will be a 3byte gap between the U's, and between the last U and the first C. This assumes that we rarely insert characters before others, because such a character would turn into a three byter.

Implicits & Specials

Break

Trail range

Specials

Details (from Markus)

ICU collation uses the last 32 lead bytes as fixed values:

We could easily squeeze this down to fewer lead bytes. For example, from 32 to 8:

The following specials are currently defined but are unused and need not be encoded at all: CJK_IMPLICIT_TAG, CHARSET_TAG, THAI_TAG.

For the LEAD_SURROGATE_TAG we currently need 10 bits of data, but if we change to using UTrie2 at the same time [or earlier], then we don't need data for that any more.

Generation

The generator for FractionalUCA (WriteCollationData) currently has a dumb algorithm for allocation. That is, given 20 characters in a script, it just increments the weights by a fixed amount, leaving a big gap at the end. If we wanted to, we could change the algorithm to spread the gap more evenly. Probably a low priority.

Testing

We need to test the tailoring of characters before the first of each reorder-type (script, Nd, IMPLICIT, TRAIL-WEIGHT,...) and after the last of each reorder-type, to make sure that they stay in the same reorder-type.