RBBI Rule Enhancements

Motivations

    • The RBBI rules as of ICU 4.6 are unable to express the UAX-14 line breaking behavior of Unicode 6.0. Some extensions are needed. The problem is with the reverse direction rule for UAX rule LB8.

    • A number of other rules could be expressed more easily if there were more fine grained control over rule chaining. It is currently either on or off for a complete set of rules.

    • Some of the existing rule syntax is extremely error prone.

    • Plain old bugs.

ICU Tickets

    • 2783, #comments in rules fail with multi-line sets. May not make sense, in which case return the bug.

    • 3058, Empty unicode set should not be an error. It turns out that there are uses for this. The contents of the set may come from a $Variable defined elsewhere, and, depending on options or whatever, a set may be empty.

    • #3640, \p{unicode property} syntax is not recognized in rules, only in sets.

    • #3769, make rule chaining optional per rule set. (This will be subsumed by #4441)

    • #4441, Rule Chaining Enhancements

      • Replache !!LBCMNoChain with something more general.

      • !!LookAheadHardBreak, remove this as an option, make it default. (Look-ahead breaks without this option are never used, behavior is not well defined and completely untested. They exist in a half-way attempt to maintain compatibility with the original Rich Gilliam engine.

    • #????, Look-ahead breaks, allow more than one to be in-flight at once. Needed for the UAX14 fixes. Requires changes to engine and to state tables. Probably a vector of length = number of states, vec[state] = input position when at a state corresponding to a '/', and side table for accepting states that complete a look-ahead, indicating which vector position(s) (states) have the break position.

    • #4444, Bugs with look-ahead breaks. Already fixed? Invesitigate.

    • #5451, 64 bit text indexes. UText does them.

  • Many More.