Design Docs‎ > ‎Properties‎ > ‎

Preparsed UCD

What

A text file with preparsed UCD (Unicode Character Database) data.

Ticket "simplify Unicode tools" http://bugs.icu-project.org/trac/ticket/8972

Status

Done for ICU 49 under ticket #8972.

Why not UCD .txt files?

See UAX #44 "Unicode Character Database"

Nontrivial parsing:

  • The UCD has grown from a couple of semicolon-delimited files plus an informative "Property dump" (early PropList.txt) to a collection of dozens of files with a variety of (now more regular) formats.
  • Related properties are scattered over several files.
  • Full information for Numeric_Value and Numeric_Type requires parsing two files.
  • Default values are "hidden" in comments.
  • The UCD folder structure (which file where) has changed over time.
  • UCD filenames change during each Unicode beta period. (A detailed version number is inserted into each filename.)
  • Many files are bloated with comments that show the General Category and name of each character or range start/end; if the data were combined into a single file, then all properties for a character or range would be listed together, without need for such comments.

Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires adding data in many of the UCD files.

ICU already preprocesses some of the UCD .txt files. We strip comments from some files (because they are huge) and in some files merge adjacent same-property code points into ranges.

Some changes are manual, such as updating and adding ranges of algorithmic character names.

Then we run several tools, most of them twice, to parse different sets of .txt files and write several output files. We use several Python and shell scripts, and a "log" (unidata/changes.txt) with details of what was changed and run in each Unicode version upgrade.

Markus has done ICU Unicode updates since about 2002. Someone else might have a hard time picking this up for maintenance and future Unicode version updates.

Why not UCD XML files?

See UAX #42 "Unicode Character Database in XML"

Good: The UCD XML file format stores all properties in a single file with a relatively simple structure, with property values as XML attributes.

Issues:

  • Missing data which is needed for ICU
    • Name_Alias added in UCD 5.0 but missing in UCD XML as of UCD 6.1 beta.
    • Script_Extensions added in UCD 6.0 but not "blessed" as a Unicode property as of UCD 6.1. Useful, used in ICU, but not available in UCD XML.
    • Adopting UCD XML would require to either still also parse some UCD .txt files or write another tool to merge more data into the XML.
  • Dependency on third party
    • Lag time between UCD .txt vs. XML availability during beta.
    • Unable to fix/update/extend XML generator tools.
    • For new properties, need to wait for standardization (UAX #42), tool update, and XML publication.
    • Will not support custom/nonstandard data.
  • Could be simpler: Parsing XML is easy in Java, Python, etc. and doable in C++ (we have a "poor man's" XML parser), but not as easy as line.split(";").
    • There is no need for complex structure for the UCD.
  • Could be easier to read for humans: By not storing defaults for all of Unicode in one place, each <group> carries them, making it hard to see which values are specific to each group. "Fluffy" XML makes for longer text lines, more horizontal scrolling.
  • Hard to diff: The XML format can be used in different ways, and Unicode publishes different forms of the same data. Also, the precise XML text depends on the XML formatting code used.
    • For diffing, a special tool needs to be run, parse old & new XML data, compare values and generate a diff report. Unicode publishes some of those too.
  • Some data still requires nontrivial parsing.
    • For algorithmic character names, the range needs to be determined by collecting a contiguous sequence of elements with a shared name pattern. There is not even any special notation for the algorithmic names for Hangul syllables.
  • Minor: Unnecessary data (for ICU)
    • Precomputed Hangul syllable names
    • Irrelevant contributory properties like "Other_Xyz"
    • Properties not used by ICU
  • Minor, just awkward: Blocks are treated as auxiliary data, rather than as a core means to organize and store the data. On the other hand, the "grouped" XML files also use them as the basis for the <group> elements and associated compaction. (The "flat" files don't.)

Goals

  • Single file with all data relevant for ICU.
  • Very easy to parse and use the data in C/C++ tools.
  • Easily human readable.
  • Easy-to-read diffs from standard diff tools.
  • Compact file format.
  • Conversion tool easy to write, maintain, extend.
  • Convert from UCD .txt files because those are maintained directly by the UTC & editorial committee. No waiting for third party to convert the files.
  • Able to extend for new kinds of data.
  • Easy format for manual data fixes/additions (e.g., PUA or proposed/draft).
  • Move much of the parsing from scattered C code into one Python script.

Details

  • All-Unicode defaults in one place, but only list non-null default values. (blk=No_Block, cf=<code point>, ...)
  • Line-oriented, always semicolon-separated, with type-of-line in the first field.
  • Block properties override defaults; only for few properties where properties in the block have common, non-default values.
    • Effective because blocks represent actual allocation & organization of Unicode. Maintained by UTC.
  • Code point/range properties override default+block properties.
  • Algorithmic names stored as ranges with type & shared name prefixes (for CJK).
  • No gratuitous white space or syntax characters.
  • Mostly key=value, simpler format for binary properties. Easy to read.
  • Comment lines with headings from NamesList.txt further improve readability. (There are few of them, so no significant size bloat.)
  • Simple, stable file generation allows diffing.
    • E.g., list properties in sorted order of property names.
  • No need to implement/store properties that are not used in ICU. (But format & tool are easy to extend.)

Plan

  • (done) Write Python tool to preparse UCD .txt files and generate one output ppucd.txt file.
  • (done) Subsume existing ucdcopy.py.
  • (done) Write toolutil C++ parser for ppucd.txt, add ppucd.txt to the unidata folder.
  • (done) Merge genbidi, gencase, gennames, gennorm into genprops
    • Replace scattered many-.txt parsers with calls to the toolutil ppucd.txt parser.
    • Generate all output files in one genprops invocation.
    • Update makeprops.sh (delete half of it) & changes.txt.
  • (done) Make preparseucd.py also parse uchar.h & uscript.h and write the property names data header file. (was: Change genpname/preparse.pl to read ppucd.txt rather than Property[Value]Aliases.txt.)
  • (done) Consider changing pnames_data.h so that minor changes don't change most of the file contents.
  • (done) Write wiki/Markus/ReviewTicket8972 with diff links.
  • Move UCD tests from cintltst to intltest, change to use the toolutil ppucd.txt parser. (ticket #9041)
  • Change Java UCD tests to parse & use ppucd.txt. (ticket #9041)
  • (partially done) Change Python preparser to not copy input UCD .txt files any more, delete them from unidata & Java. (ticket #9041)

Other tool improvements

Bad: Until ICU 4.8, the process is

build & install ICU -> build Unicode tools -> run genpname -> build & install ICU (now with updated property names) -> build Unicode tools -> run UCD parsers -> build & install ICU (now also with case properties & normalization etc.) -> build Unicode tools -> run genuca -> build & install ICU

It should be possible to

  1. merge the Unicode tools into one binary
  2. parameterize the relevant properties code (property name lookup, case & some other properties, NFC)
  3. inject newly built data into the common library for the next part of the merged Unicode tool's processing.
ICU 49:

build & install ICU -> build Unicode tools -> run genprops -> build & install ICU (now with updated properties) -> build Unicode tools -> run genuca -> build & install ICU

genprops builds the property (value) names data and injects it into the live ppucd.txt parser for further processing.

Goal:

build & install ICU -> build Unicode tool -> run it -> build & install ICU (now with all updated Unicode data)

Requires ticket #9040, could be "hard".

ċ
ppucd.txt
(1733k)
Markus Scherer,
Dec 4, 2011, 8:31 AM
ċ
preparseucd.py
(47k)
Markus Scherer,
Dec 8, 2011, 4:56 PM
ċ
testsetprops.py
(0k)
Markus Scherer,
Nov 8, 2011, 10:52 AM
Comments