Character Set Mapping Tables

ICU provides character set conversion with mapping tables for a number of important codepages. The default tables are a subset of IBM's CDRA conversion table repository. Our Converter Explorer shows aliases and codepage charts for the default tables that are built into a standard ICU distribution.

Conversions for most codepages are implemented differently on different platforms. We are providing here mapping tables from many different sources so that ICU users and others can use these tables to get the same conversion behavior as on the original platforms.

The mapping tables and some of the source code to the tools that collected these tables are checked into the https://github.com/unicode-org/icu-data repo. The table files are provided in two formats:

- .ucm files are ready for use with ICU. Their format is similar to that of UPMAP/UXMAP files from the CDRA repository. It is described in the Conversion Data chapter of the ICU User Guide.
- .xml files are in the format described by Unicode Technical Report #22

If you would like to add one of these tables to your copy of ICU, you can use the Data Customization section of ICU's User's Guide for more information on this topic.

Some analysis of the data

Here is the analysis of the collected charset mapping information created by some of our tools. Some of the aliases for the UTR #22 names can be found in our ICU Converter Explorer and in the ICU alias table.

- Similar conversion tables that have identical roundtrip mappings.
- Identical conversion tables that have identical roundtrip, fallback and reverse fallback mappings.
- A detailed comparison of all the conversion tables. This is a very large HTML file.

How we collected the data

- IBM CDRA mapping tables are converted from the CDRA RPMAP/TPMAP/UPMAP (or RXMAP etc.) files into .ucm files.
- .ucm files for other platforms are generated by using the conversion services of those platforms.
- .xml files are generated from .ucm files.

Notes on mapping tables for stateful encodings

There are two main types of stateful encodings:

1. Simple bi-state encodings that typically change states with SI/SO ISO control codes. These are especially used with EBCDIC multi-byte codepages.
2. Complex encodings following mostly the ISO 2022 model of changing states with escape sequences, SI/SO controls, and single-shift codes.

The .ucm file format supports simple SI/SO-stateful encodings by specifying the codepage structure. Codepage byte sequences for the two states differ in lengths: Single-byte codes in the initial state and double-byte codes in the other state.

We do not currently have mapping tables for more complex stateful encodings. We plan to provide them with one mapping table file per state, plus a file that lists the states with their invoking sequences and per-state mapping table names.

The current .xml file format does not support any stateful encodings.

Note on GB 18030

Of the large number (1.1 million) of mappings defined by the GB 18030 standard, only about 31000 mappings are listed explicitly. The .xml file contains the remaining mappings in <range> elements. The .ucm file leaves the affected characters unassigned and relies on the ICU converter (release 1.7 and up) to perform these mappings algorithmically.

Feedback

For feedback, comments, issues related to this collection of mapping tables please send email via Mailing Lists/Contacts.