The number of bytes we use to store keys is actually a significant part of the data, as you can see by the information on the right. So it would pay us to slim those down. Markus and I brainstormed about this, and have the following thoughts. Status: Much of this is implemented in ICU 4.4. Ideas for Size ReductionCommon Key StorageWe create a common KEY res file with the most common keys. ✓We could introduce a new TABLE type, or simply increment the formatVersion and use the top bit in each of the two existing table types to distinguish between local and pooled keys. (It would be good to be able to use 16-bit key offsets as much as possible; tests show that the size of this offset makes a significant difference.) ✓ (4.4: threshold value to distinguish local vs. pool keys) The key offsets in this table type can either come from the file or from a common KEY res file. If the offset is less than MAX_KEY, then it comes out of the KEY file; if not, then offset-MAX_KEY is used to get the key from the current file. MAX_KEY is found in the KEY file. ✓ Run genrb twice. The first time we supply a special parameter to build the KEY file. We gather all the key data, and build the KEY res file, containing just unique keys. The KEY res file also contains MAX_KEY value. ✓ (pool.res)
The second time we run genrb, we supply the KEY res file as an input parameter. Whenever a key in a TABLE occurs in that KEY file, we use the new TABLE type. ✓ The KEY res file could be either root.res itself or a separate, new file. With root.res, we would not need to load yet another file, but one could not update the root bundle without updating all of the other bundles in the locale hierarchy. ✓ (pool.res separate from root.res) Note that the KEY file is required to be part of ICU whenever any other res file is optimized to use it. Old res files, or res files that are not optimized using it, work as before. Judging from a prototype, this will save about 420kB (on top of suffix-sharing, see below) when using two pool.res files, one for locale data and one for collation data. IssueWhile swapping resource bundle data between ASCII and EBCDIC, the KEY file must be present because the keys in a table are sorted in binary string order, which changes between ASCII and EBCDIC. This might be a problem. One way out would be to always sort strings in ASCII order. On an EBCDIC platform, the lookup could swap the key into a stack buffer before doing the binary search. For pathologically long keys, there are a few options:
✓ (4.4: ASCII order, on EBCDIC with per-character swapping during string comparison) Theoretically, we could impose a charset-independent key sort order for all platforms, but it's probably not worth using an order different from ASCII. More theoretically, we could require the runtime code to read all keys in a table and copy-and-sort or build a hashtable — and not sort the keys in the binary data at all. That would always require some overhead and heap allocation similar to the string value compression. Using the same sort order (or none) for all platforms would greatly simplify swapping of resource bundles by avoiding the resorting of tables. ✓ OptionsThe key file should already be so small (17Kish - see right column) that it isn't worth going to great efforts to reduce it. But here are some ideas in case we want to think about them.
Single File Key StorageAlternatively, try to compress within a single file.
In the packed form, we can have a-zA-Z0-9 and two punctuation characters. We should encode colon ':' as 0 because the leading or trailing character must not have a 0 in its 6-bit code (otherwise we would have to encode the length explicitly), and the colon does not occur as a key-initial nor key-final character. We can encode underscore '_' as 1 followed by the letters and digits. Note: The ICU API can return key string pointers (although it is more common to use a key only for lookup). For packed keys, we will have to write a cache very similar to the string value cache discussed on the Strings page. Statistics for non-alnum key characters, from genrb processing all 660 resource bundles in an almost-ICU-4.2 build: *** non-alnum key chars (anywhere in the key): ':' = 0x3a: 12047 '_' = 0x5f: 4595 '-' = 0x2d: 543 '%' = 0x25: 248 ' ' = 0x20: 208 '/' = 0x2f: 25 '+' = 0x2b: 24 '.' = 0x2e: 9 ')' = 0x29: 3 '(' = 0x28: 3 *** non-alnum key chars as first key chars: '_' = 0x5f: 154 '-' = 0x2d: 134 '%' = 0x25: 124 *** non-alnum key chars as last key chars: '_' = 0x5f: 154 ')' = 0x29: 3 | Background InformationThe Keys are used for the TABLE type. There is a 16-bit offset to the
actual value, which is null terminated. They are not currently shared.The data is significant. In ICU4J, for example, a quick test shows the following:
Total Count: 2,409
Total Size: 728,764 bytes Total Unique Size: 17,187 bytes That is with a terminating null in each case. Here are the top contenders: Count Key 13780 0 13648 1 6910 ec 5551 other 4507 one 2698 ls 1676 ld 1324 2 1280 3 1251 few 953 4 951 5 951 6 921 many 871 lg 732 M 695 7 695 8 693 9 693 10 692 11 563 dn 539 cu 531 d 508 Version 471 y 399 wide 385 abbreviated 356 ss 317 format 313 calendar 294 sd 263 gregorian Here are the frequencies of characters in keys: 1 43655 e 2 40497 a 3 28254 r 4 25287 o 5 24928 n 6 23643 t 7 21948 i 8 21939 1 9 19295 c 10 18541 s 11 17717 0 12 16004 m 13 15452 l 14 13024 h 15 12617 A 16 11120 M 17 10813 d 18 10215 : !Alphanum 19 10140 u 20 7780 S 21 7401 y 22 7161 g 23 6720 E 24 6498 P 25 6273 D 26 6170 B 27 6105 R 28 6104 C 29 5717 k 30 5662 T 31 5418 b 32 5374 L 33 5284 N 34 5201 G 35 4908 p 36 4891 2 37 4827 f 38 4638 K 39 4007 w 40 3997 3 41 3712 U 42 3495 I 43 3428 5 44 3383 H 45 3298 4 46 3285 F 47 3223 _ !Alphanum 48 3128 O 49 2853 6 50 2801 V 51 2702 Z 52 2682 9 53 2534 v 54 2467 7 55 2350 Y 56 2318 8 57 1891 W 58 1876 z 59 1817 X 60 1519 J 61 1197 Q 62 1126 j 63 978 x 64 700 q 65 421 - !Alphanum 66 164 % !Alphanum |
Design Docs > Size Reduction >