ICU 4.8 Time Zone Names

Note: ICU4J TimeZoneNames/TimeZoneFormat were included in ICU4J 4.8 as technology preview.

Background

Time zone display name support in the LDML specification is implemented by ICU, which is rich and powerful comparing to other implementations. However, at the same time, the algorithm used by the LDML specification is quite complicated, therefore the ICU implementation is also heavy and hard to maintain. Since we moved to the current LDML model, certain type of date format operation do not perform well and it is almost impossible to customize the behavior even names returned by ICU is not good for them.

Requirements

There are several requirements related to time zone display names.

  1. Performance

      • The current implementation gathers all possible display names (many are generated algorithmically, not fetched from resource) for parsing. The initialization of the name trie is extremely heavy and require huge memory footprint. It is desired to reduce the size of trie and also defer the initialization until it is really required.

      • A set of all available names for a single time zone are initialized as whole. Even format method does not use a certain type of names, the current implementation require all of them initialized at the same time (and more than the simple locale data retrieval is required for initializing a certain type of names) .

    1. Direct access to plain name resources The ICU resource contains plain time zone names in relatively complex structure. There are a lot of code and logic involved for displaying a time zone name. For now, ICU does not expose the direct access to these plain names, instead, users can only access the names via format object (TimeZone#getDisplayName also internally call a part of format code). However, some ICU users want to access these names without other overhead. For example, Harmony/Android (or ICU's own Java TimeZoneNameProvider implementation) just want to get long/short specific names (such as Eastern Standard Time, PDT). These names are stored in the ICU resources as is and it does not require any complicated logic involved for their usage. Other users want to get names stored in the ICU resource, but not able to get them as is (always used as a part of format). For example, ICU import exemplar city name data from CLDR. Even someone want to compose a zone name using them - for example "GMT-05:00 New York, Detroit, Montreal", there is no way to collect localized exemplar city names with public APIs.

  1. Customization

      1. Before the LDML "metazone" model was introduced, ICU users can get/set time zone names through DateFormatSymbols object. DateFormatSymbols represents time zone name in a plain 2 dimensional array. After "metazone" was introduced, we can no longer represent time zone names in this form, because the metazone require historic name mappings and some logic required for distinguishing one zone from another mapped to the same metazone. For now, the time zone array in DateFormatSymbols is not used, and not useful at all. We need public APIs matching the LDML time zone display name algorithm and data structure for allowing ICU users to access and customize display names. [MD] This is especially important for CLDR work as well, that we can override any of the ICU data.

      2. GMT format (such as "GMT+03:00") is not customizable through public APIs. There is a demand to compose time zone string from different patterns.

UTS#35 Time Zone Display Name Basics

UTS#35 LDML defines several style of time zone display names and fallbacks. Certain style of names are not stored in CLDR directly and some algorithms involved for getting the final display name. Also, each name style has its own goal and round trip requirements. To achieve these goals, display name for a certain type may use a variant form. The table below illustrates these styles, algorithm and variations.

Strategies

    • Create a new class for representing the CLDR time zone display name data model

      • Make the class as an API class, so ICU users can fully customize the time zone display name data used by ICU. This makes CLDR ST to populate time zone names on the fly, allow us to write verification code for detecting name collisions, etc.

      • We can deprecate the use of the legacy 2-dimensional zone strings in DateFormatSymbols for customizing the zone display names in DateFormat, which no longer works fully with the current algorithm.

      • The new API class allows direct data access to raw time zone display names including "exemplar city" names which can be found only in a part of generic location format currently.

      • Software only requiring specific long/short names (such as Android/Harmony, Java LocaleSPI TimeZoneNameProvider) can access the names directly through the API without dragging various overheads with time zone generic name handling.

      • In ICU4J, the isolation of time zone display name data allows us to make a time zone display name data provider as a separate component.

    • Create a new public API - time zone format class interacting with the time zone display name data model class above

      • Add setter/getter for the time zone format class in SimpleDateFormat. This allows ICU users to customize time zone display names at the data level as well as allow them to fully customize time zone display name format used by ICU date format.

      • All pattern driven time zone name formatter and fallback logic belong to this class implementation.

    • Optimized for most common use cases

      • The default time patterns used by DateFormat do not contain generic time zone name types (location / non-location). These generic time zone names are not simple data, instead, a lot of fallback/construction logic involved as you can see the table in the previous section. The current ICU implementation load and initialize generic names with other type of names used by the default time patterns. The new implementation defers the initialization for generic names until they are requested.

      • The current ICU time zone parse require all names including generic names are collected and stored in a trie. By setting a limitation - do not parse names that are never generated by the date format instance, the new implementation can skip the initialization of all generic names and putting them into the trie.

Proposed APIs

[ICU4J]

Technology preview version of APIs in ICU4J 4.8:

[ICU4C]

<TBD>

Design Note

Q. Meta zone to/from TZID mapping is locale independent. Why are these methods (getAvailableMetaZoneIDs(String tzID) / getMetaZoneID(String tzID, long date) / getReferenceZoneID(String mzID, String region)) in TimeZoneNames class as instance methods?

A. It is true that the mapping is locale independent in CLDR. Initially, we thought the mapping could be different by locale, but we decided to use the single set of mapping data for all locales because it looked impossible to maintain such data in locale dependent manner. However, logically, the mapping is integral part of time zone display name data and we should not stop ICU users to define different mappings (or no mappings, i.e, no meta zones). Of course, ICU implementation share the same implementation for all locales.

Q. Do we need both TimeZoneNames#getAvailableMetaZoneIDs() and TimeZoneNames#getAvailableMetaZoneIDs(String tzID)?

A. Strictly speaking, no. The former one can be implemented if the latter one is available. There is a need to access all available meta zone IDs in ICU implementation, but generating a set of meta zone IDs from the latter version is a little bit expensive. We could change the former one (no arg version) as non-abstract, implements the method by iterating through all time zones, but ICU's TimeZoneNames class overrides the method to return the set with more efficient implementation.

Q. Do we need TimeZoneFormat.Style.SPECIFIC_SHORT_COMMONLY_USED?

A. After CLDR 2.0, we tentatively agreed to deprecate "commonly used". When the decision becomes final, we should remove the style before promoting it to public (@draft) API.

Incompatible Changes

There are some incompatible changes introduced by this proposal. I assume the impact of these changes were acceptable

    • SimpleDateFormat parse no longer looks up all possible time zone names. By default, it only supports RFC822 format, Localized GMT format and all names that are available for the time zone format pattern used by the SimpleDateFormat instance.

      • If you really need to look up all possible names for a locale, you can;

        • Create TimeZoneFormat instance and use one of parse methods which does not take "style" parameter.

        • Create TimeZoneFormat instance and set setParseAllStyles(true). You can set the instance to SimpleDateFormat too.

    • DateFormatSymbols.getZoneStrings() used to return several more name types that were never documented before. The API doc has been updated to clarify the contents of the result array and no longer includes name types other than these 4 types (long/short * standard/daylight).

    • DateFormatSymbols.setZoneStrings() still works, but it no longer affects the behavior of SimpleDateFormat using the instance of DateFormatSymbols. It's required to create a custom instance (subclass) of TimeZoneNames to provide custom time zone display name localizations.

Performance Number Comparison

A. ICU4J DateFormatSymbols - getZoneStrings()

long t0 = System.currentTimeMillis();

DateFormatSymbols dfs = new DateFormatSymbols();

long t1 = System.currentTimeMillis();

String[][] zones = dfs.getZoneStrings();

long t2 = System.currentTimeMillis();

System.out.println("DateFormatSymbols(init): " + (t1 - t0));

System.out.println("getZoneStrings() : " + (t2 - t1));

Per 10 iteration

B. ICU4J SimpleDateFormat GMT parse (ticket#7081)

long t0 = System.currentTimeMillis();

SimpleDateFormat sdf = new SimpleDateFormat("dd-MMM-yy-HH:mm:ss-z");

long t1 = System.currentTimeMillis();

try {

Date d = sdf.parse("01-Jan-2015-00:00:20-GMT");

} catch (Throwable t) {

}

long t2 = System.currentTimeMillis();

System.out.println("SimpeDateFormat(init): " + (t1 - t0));

System.out.println("Parse : " + (t2 - t1));

System.out.println("Total : " + (t2 - t0));

Per 10 iteration

C. ICU4J Full Date/Time Format Roundtrip

Date d = new Date();

long t0 = System.currentTimeMillis();

DateFormat df = DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.FULL);

long t1 = System.currentTimeMillis();

String s = df.format(d);

long t2 = System.currentTimeMillis();

try {

Date d1 = df.parse(s);

} catch (Throwable t) {

}

long t3 = System.currentTimeMillis();

System.out.println("DateFormat getInstance: " + (t1 - t0));

System.out.println("format : " + (t2 - t1));

System.out.println("parse : " + (t3 - t2));

System.out.println("total : " + (t3 - t0));

Per 10 iteration

Note: The new implementation takes more time in getInstance() comparing to the trunk because of TimeZoneFormat initialization. The difference for the second invocation is almost none, because TimeZoneFormat instance is cached. We could defer the TimeZoneFormat initialization until it's really used (for example, a format pattern that does not have any time zone field does not need TimeZoneFormat...).