SC22/WG20 N783
L2/00-355

From: Markus Kuhn [Markus.Kuhn@cl.cam.ac.uk]
Sent: Thursday, October 12, 2000 6:45 PM
Subject: (SC22WG20.3075) Comments on ISO PDTR 14652

Keld Simonsen wrote on 2000-10-12 17:42 UTC:
> The current status of the draft is that it exists a PDTR draft available
> via http://www.dkuug.dk/jtc1/sc22/WG20/docs/projects

Thanks!

Some comments on ISO PDTR 14652 from

  http://anubis.dkuug.dk/jtc1/sc22/WG20/docs/n690.pdf

a) In line 1603, please make clear that the repertoire map is optional. In
   practical implementations such as glibc 2.2, no repertoire maps
   will be used any more. All characters will be defined exclusively in
   the form <Uxxxx>. Repertoire maps are an archaic and obsolete pre-UCS
   concept that should never lead to mandatory elements of the syntax
   anywhere. Strings in locales should either be specified in <Uxxxx>
   notation for maximum portability, or in UTF-8 for maximum readabiliy.
   Repertoire maps have nothing practically useful to add to these two
   options.

b) In section 4.3.2.3, the description of the semantics of keywords
   "default_missing" and "translit_ignore" is incomplete, ambiguous
   and confusing. I haven't understood what "translit_ignore" is good
   for. Please don't explain it to me, instead rewrite the document such
   that there can be no doubt for me how I have to implement this.

c) In section 4.3.2, there is at the moment no description of a proper
   step-by-step algorithm for how transliteration has to be performed
   according to the data supplied in these keywords (especially
   "default_missing" and "translit_ignore"). With the current formulation,
   each implementor will come up with something very different. What does
   "ignore" mean for example? Substitution with the empty string? Is
   there any difference between ignoring a character and not providing
   a transliteration statement for it? (I can suggest one plausible
   transliteration algorithm, but I'd first like to read what you had
   in mind originally.)

d) Can included transliteration statements redefine previous ones?
   This is one of the many questions about the unspecified transliteration
   algorithm that the spec currently does not answer.

e) What is "combining" and "combining_level3" good for? These sets seem
   to be only meaningful in one single coded character set, namely UCS,
   and there they are hardwired into the respective latest edition of
   the ISO 10646 standard. There is no cultural dependency at all here,
   so "combining" and "combining_level3" clearly have no place in a cultural
   convention specification. They are just fixed properties of a single
   standard.

f) wcwidth() and wcswidth() depend on cultural conventions and
   transliteration but I haven't seen any provisions for the necessary
   tables. These would be much more important than "combining" and
   "combining_level3".

g) I section 4.3.2.1, I have great worries about the idea that the
   <transliteration_source> string can be more than one character long.
   This leads to an endless series of implementation problems and should
   definitely better be dropped. For example, the C99 standard requires
   all the wide-character to multi-byte conversion (that is where in the
   C library the transliteration would have to be hooked in) to be equivalent
   as if done by calls to wcrtomb(). However, wcrtomb() is required to
   swallow a wide character immediately and spit out the corresponding
   multi-byte sequence (ISO C99, section 7.24.6.3.3). There is no room for
   buffering wide characters until it becomes clear what the longest
   <transliteration_source> string is at the current position in the
   wide character stream. The mbstate_t value only keeps state in the
   sequence of multi-byte characters, not in the sequence of wide characters.
   Otherwise, the semantics of the file positioning functions would be
   messed up completely. Please please remove the option of transliterating
   strings into strings. It sounds neat at first, but clearly wasn't
   carefully thought through and obviously is not based on implementation
   experience. Single-character to string transliteration is however no problem
   at all, because this is very similar to wide-character to multibyte-character
   conversion and therefore C99 has already all the necessary infrastructure
   in place.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>