SC22/WG20 N895 From: Kenneth Whistler [kenw@sybase.com] Sent: Wednesday, November 21, 2001 7:33 PM Subject: (SC22WG20.3582) 14652: Amended U.S. comments for the Annex D Issues list Keld, As per agreement at the last WG20 meeting, I am providing amended and extended U.S. comments for the Annex D Issues list for the next DTR 14652. The material below is intended as a complete replacement for the content of D.2 in the current draft. It represents a partial rewrite and update of the comments numbered 1, 2, and 3 in the current draft, and a replacement of comments 4-6 and the last paragraph with a new list of comments 4-8. Please let me know if you have any questions about this. Regards, --Ken *********************************************************************** D.2 Comments from the U.S. member body The U.S. National Body continues to be extremely disappointed with the contents of this Technical Report. Among the serious technical problems we see in this document are: 1. As an extension of the POSIX locale syntax (cf. ISO/IEC 9945-2), this document maintaints the drawbacks of POSIX as a "specification method for cultural conventions" per se. In fact, it exacerbates the weaknesses of POSIX in this regard by conflating more, poorly justified LC_XXX formal definitions into a monolithic FDCC-set construct. This was clearly done with a particular implementation model in mind, but does not follow, nor even seem to be particularly informed by best current practice in the internationalization of software. 2. In an attempt to extend the POSIX LC_CTYPE specification to cover the repertoire of ISO/IEC 10646-1, this document blunders badly in asserting the cultural contextualization of character properties for the UCS. The treatment of LC_CTYPE as part of locales, i.e., as part of cultural adaptability, is an artifact of POSIX architecture and results from the need to have a place to put localized differences for case mapping. But by cloning other character properties having nothing to do with case mapping into LC_CTYPE, the net effect is to create a second source for specification of UCS character properties, with attendant dangers of divergence and errors, and with inevitable difficulties of maintenance and versioning. The clear intent is to influence other ISO standards to obtain their character property definitions from this document, instead of by reference to the widely implemented UCS property tables published by the Unicode Consortium. This will lead to confusion and interoperability problems for character properties. It has demonstrably already been a problem for the maintenance of the COBOL standard. 3. Each of the categories in the FDCC-set description has unaddressed problems and limitations. Rather than being resolved during the development of this document, many of these limitations were simply asserted to be "requirements". It appears to us that those are limitations of a particular envisioned implementation, engendered by legacy compatibility issues with POSIX, rather than requirements following from the legitimate needs for specification of cultural conventions. Because of this, implementers attempting to make use of the FDCC-set categories are immediately faced with an unexplained host of problems and mismatches to the actual cultural adaptability which they are trying to specify and implement to meet customer needs for information technology. 4. The repertoire map and LC_CTYPE sections deal with the repertoire of ISO/IEC 10646 as it was in 1998, but nearly 55,000 more characters have been added to ISO/IEC 10646-1:2000 and ISO/IEC 10646-2:2001. It would be a serious mistake for a technical report to be published in 2002 that uses an obsolete repertoire of characters. Even for the characters which are in the repertoire, there are problems in the LC_CTYPE section. The classes to which characters are assigned -- or in which they do not appear -- often differ from comparable property lists in the Unicode Standard without any reasonable rationale being given. Since many implementations currently base their character properties on the data files in the Unicode Standard, arbitrary departure from those values is a recipe for interoperability problems. For example, the punct class includes many currency symbols, but for no apparent reason omits such currency symbols as the drachma, dong, and kip signs. The digit class includes a large group of digits from many cultures, but does not include Myanmar, Ethiopic, FullWidth, and others that are included in the comparable Unicode class. Furthermore, the print and graph classes in LC_CTYPE do not include any Han ideographs, even though thousands of ideographs have been in ISO/IEC 10646-1 since 1993. And the tolower/toupper classes do not include the fullwidth Latin character pairs, even though Japanese national standards do include such characters, and implementations must support case mappings of the fullwidth Latin letters. 5. The repertoire map itself is a completely unnecessary addition to this document. It is intended to document and promulgate a particularly bad collection of character mnemonic short strings. The U.S. views these "mnemonics" as confusing and irrelevant to the supposed scope of the TR. The need for short identifiers for characters can be met much better by the standard short UCS identifiers spelled out in ISO/IEC 10646, which are in widespread use. 6. The LC_MONETARY section attempts to add support for multiple currencies, but does so incorrectly. The idea was to cover the time period when many European countries would be using individual national currencies and also the euro. However, the definition allows users to create multiple names for currencies, implying that the names are synonyms of each other. This is incorrect. Deutschmarks and euros are not synonyms; they are two different currencies that could be used within one country at the same time. Similarly, French francs and euros also are not synonyms, but parts of LC_MONETARY are written as if two currencies like these are the same thing. Besides the fact that the LC_MONETARY support for dual currencies is incorrect, it also is moot. By February 28, 2002, all 12 members of the European Union will have retired their national currencies and adopted the euro for all transactions. The functionality described in this technical report will be moot before the TR is even finalized. 7. The LC_TIME section includes some changes that are incompatible with POSIX.2. Some week definitions that have depended on Sunday being considered the first day of the week are changed in this TR to use Monday as the first day of the week. This would break existing implementations. Also in the LC_TIME section, timezone information has been added. The U.S. National Body objects strongly to this because such information already is separately defined via the TIMEZONE environment variable and does not belong in a locale or FDCC-set. Many countries span multiple time zones, and including timezone information makes it impossible to write a locale or FDCC-set to support such countries. 8. The new LC_XLITERATE section for character transliteration is significantly incomplete. It also doesn't belong in a locale or FDCC-set anyway. Such functionality, where defined, should be similar to code set conversion -- users should be able to pick any source and target, rather than having some limited set of transliterations hard-coded in an FDCC-set. Even if one believes transliteration should be in an FDCC-set, the support in this TR is inadequate for international needs. The syntax provided here will not work for many Asian languages (and some others), and cannot be expanded in a compatible way in the future to support such languages. The limited string conversion functionality defined here is inadequate to the general problem of transliteration and is inappropriate for inclusion in this TR.