This is currently a proposal for .res formatVersion 2, intended for ICU 4.4. The motivation for the new, incompatible resource bundle file format is a significant ICU data size reduction.
The prototype in http://bugs.icu-project.org/trac/browser/icu/branches/markus/smallres is complete for Windows and passes all tests, but needs some more work for non-Windows, big-endian and EBCDIC platforms, and code cleanup.
The ICU 4.2 .dat package data file is 16,012,800 bytes large. (Default Windows build.)
In the prototype branch, the .dat package is 13,632,048 bytes large. This represents a reduction by 2,380,752 bytes, or 14.9% of the .dat file size.
The .zip-compressed version of the .dat file is reduced from 5.63 MB (5,912,310 bytes) to 4.97 MB (5,217,935 bytes), that is, by about 11.7%.
Other changes are planned beyond the resource bundle file format changes here, for further ICU data size reductions. See the ICU data size reduction pages for ideas.
genrb can write resource bundles with either formatVersion 1.3 (new minor version) or 2.0 (new major version). The runtime code can use resource bundles with all formatVersions 1.x and 2.x.
All changes are invisible to users of the public API. New resource types are entirely internal, and are mapped to existing public resource types when queried via the API. (This was already done with URES_TABLE32.)
All resources still take advantage of memory mapping. There is no significant increase in heap memory usage. (Only some structs get some additional pointers and scalar values.)
These changes are compatible for runtime code, but the string value duplicate elimination requires a new version of the swapper code (compiled into the icupkg tool) to avoid swapping the same resource multiple times.
Within a bundle, there is a new, optional array of 16-bit units, for resource values that can be stored entirely as sequences of 16-bit units. One benefit is that none of the resources stored there need per-resource padding to 4-byte boundaries.
Across bundles, a set of resource bundles may optionally use a common pool.res bundle which may contain some or all of the key strings of that set of bundles. For each bundle using the pool, the key strings are omitted if they exist in the pool bundle, and the key string offsets indicate whether they are for a local key string or a pool key string. At runtime, the pool bundle must be available when another bundle is loaded that depends on it.
The current prototype uses this feature for the 330 locale data bundles. (Not for collation, break iterator, miscellaneous bundles, etc.)
As a consequence of the pool bundle being inaccessible during bundle swapping, the sort order of table items is fixed for all platforms. In sorting and lookup, key strings must be compared as ASCII strings even if they are not stored as ASCII. As a side benefit, iterating through a table will return its resources in the same order on all platforms, and swapping is simplified. (In formatVersion 1, the order is different on EBCDIC platforms and the swapper has to re-sort tables.)
For a complete summary of the data format changes see the updated data format description at http://bugs.icu-project.org/trac/changeset?new=icu/branches/markus/smallres/source/common/uresdata.h@HEAD&old=icu/branches/markus/smallres/source/common/uresdata.h@26012