ICU provides character set conversion with mapping tables for a number of important codepages. The default tables are a subset of IBM's CDRA conversion table repository. Our Converter Explorer shows aliases and codepage charts for the default tables that are built into a standard ICU distribution. Conversions for most codepages are implemented differently on different platforms. We are providing here mapping tables from many different sources so that ICU users and others can use these tables to get the same conversion behavior as on the original platforms. The mapping tables and some of the source code to the tools that collected these tables are checked into the https://github.com/unicode-org/icu-data repo. The table files are provided in two formats:
If you would like to add one of these tables to your copy of ICU, you can use the Data Customization section of ICU's User's Guide for more information on this topic. Some analysis of the dataHere is the analysis of the collected charset mapping information created by some of our tools. Some of the aliases for the UTR #22 names can be found in our ICU Converter Explorer and in the ICU alias table.
How we collected the data
Notes on mapping tables for stateful encodingsThere are two main types of stateful encodings:
The .ucm file format supports simple SI/SO-stateful encodings by specifying the codepage structure. Codepage byte sequences for the two states differ in lengths: Single-byte codes in the initial state and double-byte codes in the other state. We do not currently have mapping tables for more complex stateful encodings. We plan to provide them with one mapping table file per state, plus a file that lists the states with their invoking sequences and per-state mapping table names. The current .xml file format does not support any stateful encodings. Note on GB 18030Of the large number (1.1 million) of mappings defined by the GB 18030 standard, only about 31000 mappings are listed explicitly. The .xml file contains the remaining mappings in <range> elements. The .ucm file leaves the affected characters unassigned and relies on the ICU converter (release 1.7 and up) to perform these mappings algorithmically. FeedbackFor feedback, comments, issues related to this collection of mapping tables please send email via Mailing Lists/Contacts. |
ICU Charts >
