SC22/WG20 N975 From: Kenneth Whistler [kenw@sybase.com] Sent: Tuesday, August 27, 2002 7:31 PM To: tplum@plumhall.com Cc: Winkler, Arnold F; jb@benito.com; nwallace@us.ibm.com; frank@farance.com; John.Hill@eng.sun.com; rex@RexJaeschke.com; nobuyoshi.mori@sap.com; Don.Schricker@microfocus.com; keld@dkuug.dk; willemw@ace.nl; kenw@sybase.com; asmusf@ix.netcom.com Subject: RE: Agenda for Character set ad-hoc - 26th August Tom Plum wrote: > I took an action item to compare the extended-id character list > of C++ (ISO/IEC 14882:1998) versus the latest PDTR 10176. > > Here is that comparison: > > 10176 added 42011 additional codes > > 10176 deleted these codes: > > 1 01BB > 2 0384..0385 > 2 05F3..05F4 > 1 0640 > 8 064B..0652 > 1 0670 > 1 0E46 > 12 0E4F..0E5B > 4 309B..309E > 4 30FB..30FE > 250 1100..11F9 > ~1750 F900..FFDC > > I.e. 10176 deleted 12 ranges which contain about 2000 code points, > compared to the old C++ list (from the original 10176 list) > > If anybody finds errors, please post them to this list. Unfortunately, there are *many* errors in this accounting. I don't have a copy of ISO/IEC 14882:1998 to hand, so I cannot compare the listing of extended-id characters there to the 10176 listing, but I *do* have copies of all the relevant 10176 documents. Judging from what Tom observes as major deletions in 10176, what he calls "the original 10176 list" can only be the *D*TR 10176, WG20 N477, dated 1996-12-31. That document underwent a major overhaul as a result of its ballotting, based on the proposed disposition of comments (WG20 N531, dated 1997-09-26), revised as the final disposition of comments (WG20 N532R, dated 1997-12-14). The final result, which was published as the *TR* 10176, 2nd edition, can be seen in WG20 N533, dated 1998-02-15, which was the last WG20 document before the 2nd edition was published. The deletions between the DTR (an unpublished document) and the published TR 10176 2nd edition were as follows: 1 0384 2 05F3..05F4 2 309D..309E 2 30FD..30FE 240 1100..1159, 1161..11A2, 11A8..11F9 (note: 240, not 250) 1240 F900..FFDC (lots of gaps; note: 1240, not ~1750) (There were numerous additions as well as the deletions.) Incidentally, the U.S. national body (and the UTC) requested only the deletion of 0384 (which was an error for the intended 0386, which was added), and 05F3..05F4 (which are Hebrew punctuation, not letters). The other deletions were blanket removals at the behest of the then-Japanese editor of 10176 and because of the decision by WG20 to omit all combining marks in 10646 Annex B.2 ("List of combining and other characters not allowed in implementation level 2") -- which accounts for the omission of the conjoining jamos for Korean. The *Unicode* recommendations for extended identifiers contain all 1484 of those omissions, as documented at the end of Annex A in TR 10176, 4th edition. Next in the accounting trail, consider the differences between the published TR 10176, 2nd edition and the published TR 10176, 3rd edition. The differences can be found in the Amendment 1 text, WG20 N699, dated 1999-10-22, which was the last committee document before the publication of the 3rd edition. The deletions between the TR 10176 2nd edition and the TR 10176 3rd edition were as follows: 1 00B7 1 06D4 1 0E4F 2 0E5A..0E5B 10 0F2A..0F33 2 0F3E..0F3F 2 309B..309C The rationale for these deletions are: 00B7, 06D4, 0E4F, 0E5A..0E5B are all punctuation. (00B7 is the notorious MIDDLE DOT, and has to be special-cased, like LOW LINE) 0F2A..0F33 are Tibetan half-digit symbols -- not the regular Tibetan digits 309B..309C are *spacing* diacritics, comparable to the other spacing accents which have always been omitted from the recommended list for 10176. 0F3E..0F3F are combining marks -- their omission was the result of a clerical error in carrying out the committee mandate to separate the list of non-combining marks and combining marks in the listing for Annex A. The *Unicode* recommendation is to include them, as also documented at the end of Annex A in TR 10176, 4th edition. TR 10176, 4th edition has not recommended the deletion of any more characters from the list. It *did* make large extensions to account for all the additions to 10646-1:2000 2nd edition (= Unicode 3.0). However, the additions are nowhere near "42011 additional codes", since the entire repertoire consists of 49,194 graphic characters, most of which were already accounted for in the recommendations for identifiers in TR 10176, 2nd edition. The major additions were 6582 new CJK characters, 1165 Yi characters, and somewhere around 2000 for other new scripts (Ethiopic, Canadian syllabics, Khmer, Myanmar, Mongolian, etc.). The other characters on Tom's list of deletions are errors in his accounting. 0385 was *never* in the 10176 list. 01BB, 0640, 064B..0652, 0670, 0E46, 0E50..0E59, and 30FB..30FC were *all* in the original DTR 10176 list, and have *stayed* in the published 2nd, 3rd, and 4th editions. O.k., so with all of that out of the way, let me summarize the picture. Unlike the somewhat scary picture of instability suggested by Tom's conclusion, to wit: "10176 deleted 12 ranges which contain about 2000 code points" the actual state of affairs is that since the *publication* of the 2nd edition of 10176, with its Annex A, 10176 has deleted a total of 19 code points -- and two of those were the result of a clerical error. The others all have good reasons for being omitted, and only one of them, U+00B7 MIDDLE DOT, can be considered a common-use character. Since the publication of the 3rd edition, *none* have been deleted. Now, if the extended-id character list in ISO/IEC 14882:1998 was based on the DTR for 10176 2nd edition, rather than the published TR, as appears to have been the case, then the situation does, indeed, involve a rather more serious mismatch. There must have been a serious breakdown in committee liaison involved, since it seems rather questionable to base a language standard on a DTR list still undergoing substantial revision in another committee. However, it seems to me, the road forward would not consist of attempting to *remove* large numbers of characters from C++ identifiers, based on what happened in the DTR ballotting of TR 10176 2nd edition, but rather to consider the more benign consequences of trying to harmonize with the current Unicode recommendations, instead. Assuming that ISO/IEC 14882:1998 was, indeed, based on the DTR 10176 2nd edition text, here are the consequences of the two approaches: Deletions required to synch with TR 10176, 4th edition Annex A recommendations: 1 0384 2 05F3..05F4 2 309D..309E 2 30FD..30FE 240 1100..1159, 1161..11A2, 11A8..11F9 1240 F900..FFDC (lots of gaps; note: 1240, not ~1750) 1 00B7 1 06D4 1 0E4F 2 0E5A..0E5B 10 0F2A..0F33 2 0F3E..0F3F 2 309B..309C Total: 1506 deletions Deletions required to synch with the Unicode recommendations (see the recommendations appended at the end of Annex A in the 4th edition of TR 10176): 1 0384 [spacing diacritic] 2 05F3..05F4 [punctuation] 1 00B7 [punctuation: special-case for Catalan] 1 06D4 [punctuation] 1 0E4F [punctuation] 2 0E5A..0E5B [punctuation] 10 0F2A..0F33 [Tibetan half-digit symbols] 2 309B..309C [spacing diacritics] Total: 20 deletions (19 - 2 + 3) I think the second list would be a whole lot easier to document and justify to your standard's constituents. Regards, --Ken Whistler P.S. If anyone else wants to churn through the statistics here, I have soft copies of the lists from DTR 10176 2nd edition, the list resulting from the application of the proposed disposition of comments to DTR 10176, the list resulting from the application of the *final* disposition of comments to DTR 10176, the actual list from the published TR 10176 2nd edition, the list from Amd 1 (the basis for TR 10176 3rd edition), and the list from TR 10176 4th edition. So you are welcome to check my work, if you'd like. A nagging notion made me go back and double-check, and I was wrong. There were two more deletions for the 4th edition: 1 2118 SCRIPT CAPITAL P [a misidentified character] 1 212E ESTIMATED SYMBOL So the revised summary would be: Deletions required to synch with TR 10176, 4th edition Annex A recommendations: 1 0384 2 05F3..05F4 2 309D..309E 2 30FD..30FE 240 1100..1159, 1161..11A2, 11A8..11F9 1240 F900..FFDC (lots of gaps; note: 1240, not ~1750) 1 00B7 1 06D4 1 0E4F 2 0E5A..0E5B 10 0F2A..0F33 2 0F3E..0F3F 2 309B..309C 1 2118 1 212E Total: 1508 deletions Deletions required to synch with the Unicode recommendations (see the recommendations appended at the end of Annex A in the 4th edition of TR 10176): 1 0384 [spacing diacritic] 2 05F3..05F4 [punctuation] 1 00B7 [punctuation: special-case for Catalan] 1 06D4 [punctuation] 1 0E4F [punctuation] 2 0E5A..0E5B [punctuation] 10 0F2A..0F33 [Tibetan half-digit symbols] 2 309B..309C [spacing diacritics] 1 2118 [misidentified as letterlike symbol] 1 212E [symbol, not letterlike] Total: 22 deletions (21 - 2 + 3) And note that there is one other fly in the ointment. Java *allows* 2118, 212E, and 309B..309C in identifiers. So for identifier stability and interoperability, those four characters might also need to be special-cased. Regards, --Ken Whistler