Preparsed UCD

What

A text file with preparsed UCD (Unicode Character Database) data.

Ticket "simplify Unicode tools" ICU-8972

Status

Done for ICU 49 under ticket #8972.

Syntax

# Preparsed UCD generated by ICU preparseucd.py

Only whole-line comments starting with #, no inline comments.

ucd;10.0.0

Data lines start with a type keyword. Data fields are semicolon-separated. The number of fields per line is highly variable.

The ucd line should be the first data line. It provides the Unicode version number.

property;Binary;Alpha;Alphabetic

property;Enumerated;bc;Bidi_Class

Property lines define properties with a type and two or more aliases.

binary;N;No;F;False

binary;Y;Yes;T;True

value;bc;ON;Other_Neutral

Property value lines define the values of enumerated and catalog properties, with the property short name and two or more aliases for each value.

There is only one shared definition of the values and aliases for binary properties.

defaults;0000..10FFFF;age=NA;bc=L;blk=NB;bpt=n;cf=<code point>;dm=<code point>;dt=None;ea=N;FC_NFKC=<code point>;gc=Cn;GCB=XX;gcm=Cn;hst=NA;InPC=NA;InSC=Other;jg=No_Joining_Group;jt=U;lb=XX;lc=<code point>;NFC_QC=Y;NFD_QC=Y;NFKC_CF=<code point>;NFKC_QC=Y;NFKD_QC=Y;nt=None;SB=XX;sc=Zzzz;scf=<code point>;scx=<script>;slc=<code point>;stc=<code point>;suc=<code point>;tc=<code point>;uc=<code point>;vo=R;WB=XX

After the version, property, and property value lines, and before other data lines, the defaults line defines default values for all code points (corresponding to @missing data in the UCD). Any properties not mentioned here default to null values according to their type, such as False or the empty string.

The general syntax of this line is the same as for the following data lines:

block;20000..2A6DF;age=3.1;Alpha;blk=CJK_Ext_B;ea=W;gc=Lo;Gr_Base;IDC;Ideo;IDS;lb=ID;SB=LE;sc=Hani;UIdeo;vo=U;XIDC;XIDS

# 20000..2A6D6 CJK Unified Ideographs Extension B

algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-

cp;20001;nt=Nu;nv=7

cp;20064;nt=Nu;nv=4

unassigned;2A6D7..2A6DF;ea=W;lb=ID;vo=U

# No block

unassigned;2A6E0..2A6FF;ea=W;lb=ID;vo=U

algnamesrange;AC00..D7A3;hangul

Block lines specify a Unicode Block and provide an opportunity for compact data lines for ranges inside the block, by listing common property values once for the whole block. Block properties override the defaults for cp and unassigned lines with code point ranges inside the block. The file syntax and parser do not require the presence of block lines.

cp lines provide the data for a code point or range. They override the default+block properties. Properties that are not mentioned fall back to the block, then to the defaults.

Unassigned lines (new in ICU 60 for Unicode 10) provide the data for an unassigned code point or range (gc=Cn). They override only the default properties, except for the blk=Block property (if the range is inside a block). Properties that are not mentioned fall back to the defaults, except that the blk=Block property applies to unassigned lines as well.

A range is considered inside a block if it is fully inside the range of the last defined block. Otherwise it is considered outside a block and falls back only to the defaults. This is the case even if the range is inside an earlier block, to simplify parsing & processing (such data lines should be avoided).

A range inside the block for which there is no data line inherits all of the default+block properties (see Han blocks). Note that this is very different from the behavior of an unassigned line, in particular since such blocks typically default to gc!=Cn.

Non-default properties for unassigned ranges inside and outside of blocks are typically for complex defaults and for noncharacters.

ppucd.txt data lines are in code point order, although this should not be strictly required.

Assigned characters normally have their unique na=Name property value. For Hangul syllables with their algorithmically computed names, the entire range is covered by the line "algnamesrange;AC00..D7A3;hangul". For ranges of ideographic characters, a line like "algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-" provides a Name prefix which is to be followed by the code point (in hex like %04lX).

Why not UCD .txt files?

See UAX #44 "Unicode Character Database"

Nontrivial parsing:

Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires adding data in many of the UCD files.

ICU already preprocesses some of the UCD .txt files. We strip comments from some files (because they are huge) and in some files merge adjacent same-property code points into ranges.

Some changes are manual, such as updating and adding ranges of algorithmic character names.

Then we run several tools, most of them twice, to parse different sets of .txt files and write several output files. We use several Python and shell scripts, and a "log" (unidata/changes.txt) with details of what was changed and run in each Unicode version upgrade.

Markus has done ICU Unicode updates since about 2002. Someone else might have a hard time picking this up for maintenance and future Unicode version updates.

Why not UCD XML files?

See UAX #42 "Unicode Character Database in XML"

Good: The UCD XML file format stores all properties in a single file with a relatively simple structure, with property values as XML attributes.

Issues:

Goals

Details

Plan

Other tool improvements

Bad: Until ICU 4.8, the process is

build & install ICU -> build Unicode tools -> run genpname -> build & install ICU (now with updated property names) -> build Unicode tools -> run UCD parsers -> build & install ICU (now also with case properties & normalization etc.) -> build Unicode tools -> run genuca -> build & install ICU

It should be possible to

ICU 49:

build & install ICU -> build Unicode tools -> run genprops -> build & install ICU (now with updated properties) -> build Unicode tools -> run genuca -> build & install ICU

genprops builds the property (value) names data and injects it into the live ppucd.txt parser for further processing.

Goal:

build & install ICU -> build Unicode tool -> run it -> build & install ICU (now with all updated Unicode data)

Requires ticket #9040, could be "hard".