GoalsICU 4.2 offers standard Unicode normalization (all standard forms) for the latest Unicode version, as well as for Unicode 3.2 (needed for IDNA [IDNA2003]). Sometimes, normalization is part of a specification that also includes a mapping step and a validation step. For example, NFKC may be combined with case folding, and IDNA/NamePrep/StringPrep combines a custom mapping table with Unicode 3.2 NFKC and simple (forbidden code points) as well as context-sensitive validation. IDNA 2008 and the related UTS #46 add further mapping+normalization combinations in common use. MacOS X uses custom normalization in file names, and other systems need to be able to match that for compatibility (e.g., for file synchronization [rsync]). In the past, we have done some of this by
We want to be able to combine a custom mapping with Unicode normalization into one API plus a loadable data file, for better combined performance and for providing a build tool for such data. To minimize additional code and to get good test coverage, standard normalization code should be rerouted to the new code. We would like
The new data file might also carry optional UnicodeSets, for example:
StatusThis has been implemented in ICU 4.4, with the Normalizer2 API (in Java, C++ and C) replacing almost all of the old API. See the User Guide Normalization chapter for the new documentation. This design doc remains, with motivations, decisions and details of the new data structures not documented elsewhere. ICU 49 modifies the file format slightly (formatVersion 2) in support of getRawDecomposition(c) and composePair(a, b) (ticket #8804), and adding a small amount of data for FCD (ticket #8942). ICU 60 modifies the file format (formatVersion 3) to put more information directly into the per-character "norm16" trie lookup value. NFC/NFKC/... boundary detection is much simpler, and many mappings are possible without entering the slow path. The current implementation does not support storing UnicodeSet data in normalization data files. New APIclass Normalizer2: unmodifiable/immutable instances with all-virtual public methods See the Normalizer2 API (in Java, C++ and C) implemented in ICU 4.4 and above. Not currently implemented is a getUnicodeVersion(UVersionInfo v) const; function. The data is available, but there was no obvious use case since the normalizer's Unicode version is normally the same as the character property Unicode version. (This might not be true for custom data.) Filtered NormalizationWe need to support Unicode 3.2 normalization, and it may be useful to restrict normalization to other sets of characters. While building this into the core normalization code (as in ICU 4.2 and below) is possible and allows single-pass operation, it significantly complicates the code and probably slows it down slightly for normal, unrestricted operation. Instead, we implement this via UnicodeSet::span(): We (logically) split the
input into spans of text covered by the set and spans of text outside
the set. Each in-the-set span is normalized, the others are simply
appended. We could provide a wrapper class for this, either a simple one that only supports filtered normalization itself, or one that implements all of the new API. If we needed better performance, then we could create a dedicated data file that combines the Unicode 3.2 normalization data (including corrigenda) with the NamePrep mappings. (ICU 4.4 provides the FilteredNormalizer2 class and implements the old API's UNICODE_3_2 option via this class and a UnicodeSet("[:age=3.2:]").)Old and New Implementation Details
|
NFC_QC | NFD_QC | ccc | has comp boundary before |
combines forward |
comments | sample char NFKC_CF | 16-bit value |
any | any | any | any | any | all values | any | bit 0 is set if the character has a composition boundary after it; mapping or composition indexes are in bits 15..1 |
Yes |
Yes |
≠0 |
no | no |
Most combining marks |
◌् | fe02..fffe bits 8..1: ccc≠0 |
Maybe |
Yes |
0 |
no | no |
Jamo V & T |
ᅡ | fe00 |
Maybe |
Yes |
any |
no | no |
Combining marks that combine backward |
◌̀ | fc00..fdfe bits 8..1: ccc |
Maybe |
Yes |
0 |
no | yes |
Both combine-back & combine-fwd: strange but allowed |
none | ≥minMaybeYes which is 8-aligned index into composition table |
No |
No |
0 |
yes | no |
Has 1-way mapping to a nearby character with NFC_QC=Yes and ccc=0, ASCII maps only to ASCII; size optimization for case mappings |
A | ≥minNoNoDelta=minMaybeYes-((2*maxDelta+1)<<3) delta=0 is at minMaybeYes-((maxDelta-1)<<3); it must not be used bits 2..1: tccc=0 or 1 or >1 |
No |
No |
any |
any | no |
Gap/unused values; could be used for other algorithmic mappings |
≥limitNoNo builder must test for collision of adjacent ranges: check for minNoNoDelta≥limitNoNo |
|
No |
No |
any |
no | no |
Has 1-way mapping to an empty string, with explicit lccc=1 and tccc=255 values | U+00AD Soft Hyphen | ≥minNoNoEmpty index into decomposition table (raw mappings might differ) |
No |
No |
any |
no | no |
Has 1-way mapping which has no composition boundary before it |
◌̈́ | ≥minNoNoCompNoMaybeCC index into decomposition table |
No |
No |
any |
yes | no |
Has 1-way mapping which has a composition boundary before it (starts with a starter that does not combine backward) |
Ÿ | ≥minNoNoCompBoundaryBefore index into decomposition table |
No |
No |
any |
yes | no |
Has 1-way mapping which is composition-normalized (has a composition boundary before, does not recompose by itself) |
ß | ≥minNoNo index into decomposition table |
Yes |
No |
0 |
yes | no |
Has 2-way mapping, does not combine forward |
à | >minYesNoMappingsOnly index into decomposition table |
Yes |
No |
0 |
yes | no |
Hangul LVT (does not combine forward) |
각 | =minYesNoMappingsOnly|1 |
Yes |
No |
0 |
yes | yes |
Has 2-way mapping, combines forward |
â | >minYesNo index into decomp+comp table |
Yes |
No |
0 |
yes | yes |
Hangul LV (combines forward) |
가 | =minYesNo |
Yes |
Yes |
0 |
yes | yes |
Starter, combines forward |
a | ≥4 index into composition table |
Yes |
Yes |
0 |
yes | yes |
Jamo L |
ᄀ | 2 |
Yes |
Yes |
0 |
yes | no |
Normalization-inert |
! | 1 |
Per-character lookup values, .nrm formatVersion 1 & 2
Version 1: ICU 4.4..4.8
Version 2: ICU 49..59
Possible combinations and their encoding:
The rows of the table, from bottom to top, are encoded with increasing 16-bit "norm16" values as noted in the last column. Per-row and per-row-group properties are determined via norm16 range checks.
The minYesNoMappingsOnly distinction was added in ICU 49, .nrm formatVersion 2.0. In formatVersion 1.0, the data for either type of yesNo characters (combines-forward or not) were mixed, and the mapping's firstUnit bit 6 was the MAPPING_PLUS_COMPOSITION_LIST flag. See the firstUnit documentation below.
NFC_QC | NFD_QC |
ccc |
combines-fwd |
comments |
16-bit value |
Yes |
Yes |
!=0 |
no |
Most combining marks |
ff01..ffff lower 8 bits: ccc |
Maybe |
Yes |
0 |
no |
Jamo V & T |
ff00 |
Maybe |
Yes |
any |
no |
Combining marks that combine-back |
fe00..feff lower 8 bits: ccc |
Maybe |
Yes |
0 |
yes |
Both combine-back & combine-fwd: strange but allowed |
minMaybeYes..fdff index into composition table |
No |
No |
0 |
no |
Has 1-way mapping to a nearby character; size optimization for case mappings |
minMaybeYes-(2*maxDelta+1)..minMaybeYes-1 (delta=0 is at minMaybeYes-maxDelta-1; it must not be used) |
No |
No |
any |
no |
Gap; could be used for other algorithmic mappings |
unused values builder must test for collision of adjacent ranges |
No |
No |
any |
no |
Has 1-way mapping |
minNoNo..maxNoNo index into decomposition table |
Yes |
No |
0 |
no |
Has 2-way mapping, does not combine forward |
minYesNoMappingsOnly..minNoNo-1 index into decomposition table |
Yes |
No |
0 |
yes |
Has 2-way mapping, combines forward |
minYesNo+1..minYesNoMappingsOnly-1 index into decomp+comp table |
Yes |
No |
0 |
any |
Hangul LV (combines forward) & LVT (does not combine forward) |
minYesNo |
Yes |
Yes |
0 |
yes |
Starter, combines forward |
2..minYesNo-1 index into composition table |
Yes |
Yes |
0 |
yes |
Jamo L |
1 |
Yes |
Yes |
0 |
no |
Normalization-inert |
0 |
Additional data indexed by the trie value
(Implemented in ICU 4.4, .nrm formatVersion 1.0. Modified in ICU 49, .nrm formatVersion 2.0 and in ICU 60, .nrm formatVersion 3.0.)
"Extra data" per code point, if it has a mapping or if it combines-forward, is stored in 16-bit-unit arrays. The character's lookup value is an index into one of these arrays. It is probably handy to have two arrays, so that indexes can be allocated independently for the two ranges of 16-bit lookup values that are indexes into extra data.
- One array with composition lists for MaybeYes characters which don't also have a mapping.
- Usually, MaybeYes characters don't have composition lists, so this array will usually be empty.
- One array with
- Composition lists for YesYes characters which don't also have a mapping
- Mappings and optional composition lists for YesNo characters which do have a mapping
Threshold values like minYesNo depend on the mapping data.
The mapping string is stored together with the trailing ccc (tccc) value, and also with the ccc and lccc values if they are not 0.
Mapping to an empty string is encoded as a regular mapping with length 0.
- formatVersion 3 stores explicit worst-case values lccc=1 and tccc=255.
- formatVersion 1 & 2 store ccc=lccc=tccc=0, and the worst-case values are computed at runtime.
If both a mapping and a composition list are stored for a character (only possible for YesNo), the mapping comes first.
- In formatVersion 2+, the trie value thresholds indicate whether there is a composition list.
- In formatVersion 1, a bit in the first word indicates that there is a composition list.
Optional mapping
- First unit
- Bits 15..8: tccc
- Bit 7: Has another data word with ccc & lccc (otherwise they are 0) [MAPPING_HAS_CCC_LCCC_WORD]
- Bit 6:
- formatVersion 2+: Has a raw mapping that differs from the recursively-decomposed mapping [MAPPING_HAS_RAW_MAPPING]
- formatVersion 1: Has composition list [MAPPING_PLUS_COMPOSITION_LIST]
- Not actually needed in recomposition because the compositeAndFwd result includes this flag.
- The only use for this flag was in Normalizer2Impl::hasCompBoundaryAfter().
- formatVersion 2 sets bit 5 [MAPPING_NO_COMP_BOUNDARY_AFTER] when the original character has a composition list (combines-forward), freeing up this bit for MAPPING_HAS_RAW_MAPPING.
- Bit 5: Reserved (zero) since formatVersion 3.
- formatVersion 1 & 2: Does not have a composition boundary after its original character. [MAPPING_NO_COMP_BOUNDARY_AFTER]
- This bit is set if the mapping has no starters or if the last starter in the mapping can combine-forward with another character following the mapping.
- formatVersion 2: This bit is also set if the mapping length is 0 (the original character is deleted) or if the original character combines-forward (i.e., there is a composition list), so that those conditions need not be tested separately at runtime.
- Bits 4..0: Length of the mapping, 0..31
- Optional ccc/lccc word, only stored if bit 7 of the first unit is set [MAPPING_HAS_CCC_LCCC_WORD]
- Bits 15..8: lccc
- Bits 7..0: ccc
- formatVersion 1: The ccc/lccc word was stored as a "second unit" between the "first unit" and the mapping string.
- formatVersion 2+: The ccc/lccc word is stored immediately before the "first unit". The mapping string always follows immediately after the first unit.
- <length> UChars with the mapping string
- formatVersion 2+ also stores a raw mapping if bit 6 of the first unit is set [MAPPING_HAS_RAW_MAPPING]
- The raw mapping is stored immediately before the ccc/lccc word (if present) or the first unit (if there is no ccc/lccc word).
- It is stored as either one 16-bit unit "rm0" (which is 0x20..0xffff) or as the raw mapping string followed by a 16-bit unit containing its length (1..0x1f).
- If "rm0" is stored, then the raw mapping is the character rm0 (raw mapping index 0) followed by all but the first two code units in the regular mapping string.
Optional composition list
- Logically, a list of pairs (combining-back code point, compositeAndFwd) where compositeAndFwd is a combination of the composite (bits 21..1) and bit 0 which is set if the indicates composite combines-forward.
- Physically, a list of pairs or triples of 16-bit units, sorted in ascending order of combining-back code points.
- In the first unit, bit 15 is set in the last tuple. Bit 0 is set if it's a triple, otherwise it's a pair.
- Combining-back code point 0..33ff, composite 0..7fff: pair
- First unit bits 14..1 contain the combining-back code point
- Second unit contains the compositeAndFwd
- Combining-back code point 0..33ff, composite 8000..10ffff: triple
- First unit: same as for the previous case
- Second unit bits 15..6 are 0, bits 5..0 contain the compositeAndFwd's bits 21..16
- Third unit contains the compositeAndFwd's bits 15..0
- Combining-back code point 3400..10ffff, composite 0..10ffff: triple
- First unit bits 14..1 contain 3400 plus the combining-back code point's bits 20..10 (total value 3400..383f)
- Second unit bits 15..6 contain the combining-back code point's bits 9..0
- The remaining second/third unit bits are the same as for the previous case
In the ICU implementation, it is ok to not store the ccc value directly in the lookup value for NoNo characters. When the quick check fails with YesNo, NoNo or MaybeYes, the surrounding sequence is decomposed, which does not use the original characters' ccc values. Composition then sees only YesYes and MaybeYes characters which do have their ccc values in the lookup value.
A composite that combines-forward has quick check flags YesNo, has a mapping, has ccc=0 (it's a starter) and lccc=0 (it composes from a starter plus another character) and has a composition list (it combines-forward).
Old vs. new: The old composition data uses combine-forward and combine-back indexes stored in the extra data next to the mapping. In the new data structure, the combine-forward index is replaced by appending the composition list after the mapping, and the combine-back index is replaced by searching in the list for the back-combining code point itself.
Multiple characters may share data when they have the same mappings. This is especially useful for mappings to empty strings and case and other foldings.
Raw decomposition mappings
It is sometimes useful to have access to the raw/original decomposition mappings, without recursive decomposition.
- It would be nice if we could build Normalizer2 data at runtime; for that, we would want to start from the raw/original mappings. (ticket #7743)
- We would also want API to find out if a mapping round-trips; best like getCompositionQuickCheck(c).
- Note: For standard Unicode data, we already have properties and getters for NFC_QC and NFKC_QC. We might not need public API for getCompositionQuickCheck(c) for custom data.
- The raw/original mappings might also be useful for text layout engines. (ticket #8804)
- The raw mapping from an NFKC instance corresponds to the Unicode Decomposition_Mapping (dm) property.
- The raw mapping from an NFC instance corresponds to the Unicode Decomposition_Mapping (dm) property for characters with Decomposition_Type (dt) = Canonical (Can).
Some of this data is computable from the current structure, by mapping and then recomposing until the original character would be written. However, this would be expensive and incomplete.
.nrm formatVersion 2 adds compressed data for raw mappings:
- An algorithmic mapping is itself the raw mapping.
- In most other cases, the normal mapping has not been recursively decomposed and is the same as the raw mapping.
- If the normal mapping has been recursively decomposed, then the MAPPING_HAS_RAW_MAPPING is set in the extra data, and the raw mapping is stored there as well. Some of those mappings are further compressed; for details see the previous section.
This required an incompatible data format change because in the formatVersion 1 "extra data" for each code point there was not one unused bit. However, the first unit's bit 6 [MAPPING_PLUS_COMPOSITION_LIST] was underutilized. It was easy to subsume its actual use into bit 5 [MAPPING_NO_COMP_BOUNDARY_AFTER], freeing bit 6 for its new use as the MAPPING_HAS_RAW_MAPPING flag. As a result, access to the raw mapping is only slightly slower (due to the "compressed" format) than to the normal mapping.
Size vs. Speed
Minimal Mappings vs. Recursively Decomposed
The old data always stores decomposition mappings that cannot be further composed. The builder (gennorm) recursively decomposes each "raw" input mapping with each other applicable mapping. As a result, the decomposition is very fast.
For a smaller data file, it would be possible to store only the "raw" input mappings, without having recursively decomposed them. For example, the Angstrom character could have just the mapping to A-ring, and only A-ring would have the mapping to A+ring. For reasonable performance, one more bit could be stored with each mapping, indicating whether the mapping can be decomposed further. (Code optimization: 1:1 mappings could use a loop, not recursion.) There should also be a flag in the data file for whether all mappings are recursively resolved, or not.
This would especially help with the data size when compatibility mappings and case foldings are included. Many of those are one-way mappings to a single code point each (1:1 mappings).
Note: Extracting FCD data when the mappings are not already recursively decomposed would require full decomposition because the mapping's lccc and tccc values may not be those of the full decomposition.
Not-fully-decomposed 1:n mappings would significantly complicate the runtime code.
Even if the builder does not store recursively decomposed mappings, it should still compute them to check against cycles, and maybe to check against large recursion depths.
(Implemented in ICU 4.4, formatVersion 1.0: Mappings are recursively decomposed, favoring performance. ICU 49, formatVersion 2.0, adds additional, compressed data to recover the raw mappings efficiently and with little size overhead.)
Optimized 1:1 Mappings
The subset of NoNo characters with 1:1 mappings are the most attractive for compact storage because there is relatively a lot of overhead in storing them like sequences. Ideas:
- In particular, the (usually large) gap between NoNo and MaybeYes mapping indexes into extra data can be used for algorithmic 1:1 mappings.
- Store a small delta directly in the 16-bit lookup value. Deltas would be at least -0x20..+0x20 to cover common case mappings. An additional bit for "decomposes further" could be included.
- Store a BMP code point directly in the 16-bit lookup value. For example, ASCII or Latin characters and Katakana/Han/Hangul characters that are often the target for case foldings and compatibility mappings.
- Store an index into a more compact data structure.
- Use impossible bit combinations in the extra data for alternate encodings.
- A NoNo character cannot combine-forward. If its extra data has the combine-forward bit set, the data structure could be different. One possibility: Store a one-character mapping in that first unit. Could store 0..0x33ff and the "decomposes further" bit; could store larger code points in two units, with the upper 7 bits added to 0x3400.
- A YesNo character cannot have non-zero ccc and lccc. If its extra data has the has-ccc/lccc bit set, the data structure could be different.
- formatVersion 3: A delta mapping may only lead to a YesNo character with ccc=0, so there is a composition boundary before it, and the runtime code never needs to loop. An ASCII character only maps to another ASCII character.
- formatVersion 1 & 2: A delta mapping may lead to another NoNo or YesNo character, so the runtime code must loop.
- All other mappings are stored in the extraData and are fully decomposed.
We should measure how much of the data size we can save, compared with the additional code complexity and lower performance to handle more variants. (Measured: With Unicode 5.2 NFKC + case folding, a simple, fairly small delta of ±0x40 reduced the data file size by 8.8%. Smaller deltas yielded smaller savings. A much larger delta of ±0x1000 only got to 9.3%, so ±0x40 was chosen for formatVersion 1.0.)
Mapping Table Text Files
Normalization data is built from text files. Basic syntax is supported starting in ICU 4.4. The supported syntax is documented in the User Guide Normalization chapter.
Not implemented yet
The following syntax elements might be useful but are not yet supported.
It might be useful to be able to set the override mode, and to start a new "phase" within one source file where it is allowed to override mappings from previous phases:
* override previous # none|any|previous
It might be useful to add syntax for a "filter" set and a validity set (optional). The syntax needs to indicate type, start and end of the set and allow multiple lines. Line breaks will be removed before passing the set pattern into the UnicodeSet constructor. If a filter set is defined, then the generator tool would ignore any data for code points outside that set.
* filter [:age=3.2:]
* filter
[:^
*+ Hani
*+ :]
Some syntax might be useful for the Unicode version (optional). Multiple version specifications are ok only if they have the same value. The Unicode version is otherwise settable via a command-line option.
* unicode 5.2
It might be useful to add syntax for whether Hangul syllables and conjoining Jamo behave as defined in the Unicode standard. By default, they do so, and data for them is not allowed.
* hangul off
New gennorm2 Tool
gennorm generates text files (nfc.txt and nfkc.txt) from UCD files, for standard Unicode normalization. The output text files are checked into icu/source/data/unidata/norm2. gennorm has been moved to the tools repository tree.
gennorm2 is a new, installable (ICU 4.4) end-user tool in the icu repository tree, parsing the .txt files and generating .nrm files.
It should be easy to include the standard Unicode normalization ccc and composition data. One way is to always include the standard ccc data and have syntax for a flag to request standard compositions. (Rejected.)
Another, simpler way is for gennorm2 to take a list of mapping table files, and to provide standard files like ccc.txt, compose.txt, nfd.txt, nfkd.txt and casefold.txt that could be combined (with or without additional custom tables) in various combinations into one binary data file. This would also allow for a character to have different mappings in different files, and the later mapping would override the earlier one. gennorm2 should be able to also output a .txt file with all of the combined data, except without recursively resolved mappings, to keep two-way mappings in the file valid for input. (Done in ICU 4.4. Modification: The NFKC mappings cannot simply add to the NFC mappings because some characters with two-way NFC mappings have one-way NFKC mappings. Therefore, there are separate files that specify each normalization form's mappings.)
We should make it easy to move StringPrep mappings from the .spp files into normalization .txt/.nrm files. (Not done [yet].)