Design Docs‎ > ‎Collation‎ > ‎

Special Byte Values

See UCA Weight Allocation

Byte values used in Collation Elements (CEs) and sort keys: (See the bottom of the spreadsheet for version information.)

Notes:

  • This page, with its tables and notes, is descriptive for certain versions of ICU. Most of the details here are subject to change. See the User Guide section on “Sort Key Features”.
  • All-zero leading weights are ignored. There cannot be all-zero trailing weights, except for a completely-ignorable character where all weights are zero. (UCA WF1)
  • Within a non-zero weight, trailing bytes are zero when the weight is short. There cannot be leading zero bytes.
  • In a sort key, only and exactly the last byte (terminator) must be 00, so that strcmp() can be used.
  • In a sort key, 01 must only be used as a level separator, so that ucol_mergeSortkeys() works. Therefore, 01 cannot be used in any weight bytes. (May change in the future.)
  • ICU 52 and before has a fixed range of lead bytes for implicit primary weights (Han & unassigned).
  • ICU 53 and later compute a Han lead byte range as needed for the number of Unified_Ideograph characters. One lead byte is used for unassigned-implicit primaries.
  • Tertiary weights are stored in 6 bits per byte. With default options, sort key generation moves CE bytes 06..3F to C6..FF to make room for compression values 06..C5. The case-handling options modify the exact range of byte values.
  • The quaternary level contains shifted primary weights (if alternate=shifted) and quaternary weights, usually for the Japanese tailoring to sort Hiragana then Katakana.
    • ICU 52 and before generated the Hiragana/Katakana distinction purely at runtime.
    • ICU 53 and later use explicit "<<<<" tailoring which modifies 2 quaternary CE bits. These map to the "common" compression range 1C..FC and the byte values FD..FF.
    • ICU 53 for shifted primaries ≥1B inserts a sort key lead byte 1B before appending the shifted primary weight. (variableTop/maxVariable should be low enough so that this cannot occur.)
  • A final, "identical" sort key level can be generated at runtime (as a tie-breaker). The "BOCSU" encoding uses the byte values 03..FF.



Comments