It is possible to do fast searching that takes into account linguistic equivalents, using a modified Boyer-Moore algorithm. For more information on such matching, see Searching and Matching in UAX#10: Unicode Collation Algorithym. In order to do this, it is important to determine a canonically equivalent shortest form (CESF). This is usually, but not always, the NFC form. ICU has a CanonicalIterator which can be used to get all the canonically equivalent variants. One could use it to iterate through and select the shortest form. However, it was not designed for this particular task -- more efficient code can be designed that uses a modified NFC algorithm. This document presents some thoughts on how to do this. In an annex, it also presents a Quick&Dirty Approach using the CanonicalIterator. Note that here when we say "decomposition" or "equivalence", we always mean canonical. Counts and other information are as of Unicode 5.1. BackgroundWe want uniqueness for the CESF, and also for it to be the same as NFC where possible. Note that CESF does not have the same stability constraints as Unicode Normalization Forms - it is intended purely for internal processing.We start by looking at the two kinds of cases where NFC is not shortest, as discussed on nfc-faq.
ExampleAn example of interference is the sequenceU+1EA5 ( ấ ) LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE + U+0328 ( ̨ ) COMBINING OGONEK. What happens is that the U+0105 ( ą ) LATIN SMALL LETTER A WITH OGONEK forms first, and prevents the formation of the shorter composite. (See below for a detailed example.)SpecificationUse the basic NFC algorithm, with the following changes.
ExampleIn the following case, each row represents looking at that character in the sequence. The highlighted cells are where a new composition is produced.
There is only one shortest form (the final answer in column 2), so it is the result. If the result in column #2 had not been present, then column #1 would have been the first shortest value (and also NFC), so it would have been chosen as the result. ImplementationThe above lends itself to many optimizations. Although generating a tree of results seems scary, in practice there can only be a few branches. Moreover, one can easily prune the results along the way. One only needs to recurse, for example, if there is a possibility that there is an interfering character. So a small list of such characters can be tested for to avoid excess processing. Where data size is an issue, the majority of the data and can be shared with the NFC implementation; only the breakage cases need to have a special flag.Implementation can also be done by postprocessing a string in NFC:
Make a "problem list" of characters that can occur in a case of breakage or interference. This list will need to be generated for each version of Unicode, but can be done at build time. It includes all of the characters in the breakage list, like a nukta character, plus all the characters that are formed in the case of interference (like
U+0105 ( ą ) LATIN SMALL LETTER A WITH OGONEK plus circumflex plus grave, from the above case).After the string is in NFC, take another pass through, looking for any starter sequence where the starter and at least one non-starter are in the problem list. Because the non-starters are quite rare in practice (see nfc-faq) a quick scan can be done for them, backing up to check the preceding starter if one is found. In addition, in each sequence of non-starters, check for the sequence:
All of these problem starter sequences should be relatively rare in practice, allowing for a very quick scan. If we find a candidate, find its enclosing segment (see below) and do the above processing. See also the Quick&Dirty Approach. It won't give precisely the same results, but should be sufficient. Annex: Types of CharactersFor background information, here are some relevant types of characters. Because Hangul is unproblematic, we can exclude that from the counts below for simplicity.
We can logically break up any NFD string into a set of segments, where each segment starts with a lead-only starter or stable character, and has zero or more others characters afterward. There is also the first segment, which may be "headless". Each segment does not interact with any characters before it or after it, in terms of canonical equivalence. Thus if each segment is mapped to the CESF, then the string as a whole will be in CESF. So then the problem devolves to mapping the segments. Luckily, we can do this with the same mechanism as in NFC, looking at sequences that are starter + non-starter. [TBD: show why this is sufficient.] Quick&Dirty ApproachHere is a relatively easy way to get a shortest form that would probably suffice for collation in terms of function and performance: Function, because it appears that we don't actually need uniqueness, just the length of the shortest form; Performance, just because the bad cases are so rare.
|