Title ISO/IEC CD 14651 - International String Ordering - Method for comparing Character Strings and Description of a Default Tailorable Ordering
[ISO/CEI CD 14651 - Classement international de chaînes de caractères - Méthode de comparaison de chaînes de caractères et description d'un ordre de classement implicite adaptable]
Status: Committee Draft for Registration and CD Letter Ballot
Date: 1996-04-12
Project: 22.30.02.02
Editor: Alain LaBonté
Gouvernement du Québec
Secrétariat du Conseil du trésor
Service de la prospective
875, Grande-Allée Est, 4C
Québec, QC G1R 5R8
Canada
GUIDE SHARE Europe
SCHINDLER Information AG
CH-6030 Ebikon (Bern)
Switzerland
Email: [email protected]
Table of Contents:
FOREWORD
INTRODUCTION
Tutorial on problems solved by this standard
1 Scope
2 Normative References
3 Definitions
4 Symbols and abbreviations
5 Requirements
5.1 Prehandling phase (external to the comparison operation engine)
5.1.1 Prehandling of the symbolic table data
5.1.2 Prehandling of character strings provided to the comparison operation engine
5.2 Comparison operation engine
5.2.1 Multi-field key comparison
5.2.1.1 API 1 - Comparison done directly on character strings (COMPCAR)
5.2.1.2 API 2 - Comparison done on predigested processable bit strings (COMPBIN)
5.2.1.3 API 3 - Conversion of a character string to a comparable bit string (CARABIN)
5.3 Multilevel key building
5.3.1 Preliminary considerations
5.3.1.0 Assumptions
5.3.1.1 Table sections and processing properties
5.3.2 Key composition
5.3.2.1 Formation of properties vector
5.3.2.2 Formation of subkey level 1 through m minus 1 (level i; m=4 in the default)
5.3.2.3 Formation of subkey level m (m=4 in the default table)
5.3.2.4 Formation of subkey level 5
5.3.2.5 Posthandling
5.4 Table formation
5.5 Default table
6. Conformance
7. data specification
7.1 Data specification
7.2 Tailoring Mechanism
Normative annexes
Annex 1 (normative) International Default Table
Annex 2 (normative) Benchmark
Informative annexes
Annex A (informative) - Criteria used initially to prepare the standard
Annex B (informative)
Description of the prehandling phase
Description of the Posthandling Phase
Annex C Sources for methods and data gathering
Annex D (informative) Preliminary principles of table assignments
Annex E (informative) - Principles of the comparison engine
Annex F. Revised (if necessary) - From a requirement to its implementation - Compare, Sort, Search
Annex G. Discussion on the number of levels for each script and their harmonization
Annex H. Example of national ordering standards and how they can be harmonized to the international standard
This is one of the major flaws that affect portability between countries and between applications. (Traditionally, different programs make different ordering corrections.) Therefore, it has been considered feasible to design a Default Tailorable Ordering Mechanism (a method and a unique table). This mechanism will constitute an acceptable tool that will make sense for most users of the different scripts. Also, most simple applications will be able to use the mechanism without modification. These applications use ordering dependencies that are not dependent on any context.
Naturally, a modification mechanism is embedded in the model. The mechanism will accommodate particular languages with a minimum of changes. Let us look at Latin Script as an example. The Spanish and Scandinavian languages will have the order of a few letters changed compared to the order acceptable in most other European languages that use the Latin script. Also, a whole script order change could be desired relative to another one -for example, Thai before Latin, and so on.Furthermore, there might be specific linguistic requirements that cannot be fulfilled without knowing the context. For example, Japanese names expressed in Kanji cannot be deduced solely in phonetic ordering. Instead, Japanese names need hidden multiple fields. Generally, in Japanese databases, a given Kanji proper name is associated with a hidden phonetic representation in a different field. This association allows correct ordering, otherwise a replication of items might be necessary for human searching of Kanji proper names in a list in the absence of other fields.
More generally, specific requirements exist for complex telephone-book type transformation or for phonetic transformation. This is particularly true in multi-lingual countries or organisations. As an example, the item "4" could sometimes be phonetically classified (transformed) in such lists to accomplish ordering. This transformation requires that the item be reproduced several times. Each replicated item is hence transformed for phonetic ordering (for example, as "QUATRE", "FOUR" and "VIER" in French, English, and German respectively). In this way, a user can immediately retrieve the item "4" in a list under "Q", "F" and "V" depending on the individual user requirements.
To achieve these requirements, the comparison and ordering mechanism on which focus is directed here is included in a more general model. The general model is also described in this international standard. The general model allows multiple-field ordering and prehandling and posthandling transformation phases. The ordering mechanism assumes this higher-level scheme.Specifically, the prehandling and posthandling phases could be null processes. Also in the simplest applications, only one field will be ordered typically. In such cases, a straightforward order could be achieved and would be reasonably valid for the majority of users who do not require further specialised transformation. The typical lexical dictionary order in a given natural language is an example of this type. It is assumed that lexical order is the minimal culturally acceptable order for a list so that the general public, and even specialists, can use it without error.
To simplify matters, the Default Tailorable Ordering Mechanism will describe a method to order text data independently of context. The method will be culturally acceptable to a majority of world-wide environments (with provisions to accommodate more local contexts).
It is obvious that ordering is not limited to a sorting program. Ordering requires that string comparison be consistently redefined with a new comparison engine. This engine will be used by processes which compare, sort, search, mix, and merge graphic character data. This engine will be described in this international standard.
The design of this international standard keeps in mind that old systems could also integrate culturally valid ordering with minimal changes. Therefore, the basic engine will not work directly on a text string of graphic characters. Instead, the first phase of the process reduces the text string to a single bit string that is suitable for direct and mechanical numeric comparisons.
Numeric data has two general kinds of representation. One type of representation is external and uses human readable graphic characters. The other type of representation is internal and is directly suitable for high-speed processing. For this reason, programming languages define data types for suitable processing of numbers (in general more than one type). In this way, programmers do not need to parse graphic characters before performing numeric processing. This parsing would be very prone to errors, add to programming complexity, and would not achieve general consistency among different applications.
Character comparisons are of a more complex nature. Therefore, having the programmers involved in parsing is not more desirable. Nevertheless, this was the prevailing situation before the present international standard was designed.
The consistent text data comparison engine described in this international standard works on an internal structure that is the result of parsing an original string for comparison. Parsing is done according to a formal description of cultural ordering conventions. The definition of such an engine makes it highly desirable that future versions of programming language standards define new data types. In each language, it is desirable that at least one data type manage graphic character string comparisons that are not limited to absolute equality. The programming language can define these data types as formal containers. These containers represent strings of text that can be processed internally, in a way that is very straightforward and independent of coded graphic characters.
In this way, the programmer is freed from parsing processes. Also, the probability of achieving application portability between different countries using different cultures would be increased because applications can be designed in a generic way.
Furthermore, the pre-digested structure materialising such a data type can be stored and reused in a given cultural environment for increasing performance and allow preserving past applications with minimal changes. Reusing the structure would require no further parsing by external, even ancient, hard-wired engines that have the capability to do straightforward binary comparisons (such as a hardware disk search engine, or an access method designed decades ago that developers do not want to redesign because of its high efficiency). This feature is a non-negligible economic by-product of this international standard: once a string has been parsed for an environment, its processing does not require re-parsing. In fact, as for numbers, the standard graphic character representation need not be used until data is presented again to the user. This calls for reversibility of the process. The present standard makes that reversibility a possibility, in addition to guaranteeing the full predictability of the comparison operation. If two equivalent strings are not absolutely equal, then the tie must be broken. Consequently, a sort program, the simplest application, can always sort data in the same way.Ex.: Sorting the list "august", "August", "container", "coop","co-op", "Vice-president", "Vice versa" gives the following order, if ISO 646 coding is used and a simple sort following binary order is done:
August
Vice versa
Vice-president
august
co-op
container
coop
which is obviously wrong.
ii. Translating lower case to upper case and removing special characters gives a sorted list acceptable to users, but also unpredictable results.Ex.: Sorting the list "August", "august", "coop", "co-op" gives the following order:
August
august
coop
co-op
Sorting the same list with a different initial order, say, "august", "August", "co-op", "coop" gives a different order with this method:august
August
co-op
coop
iii. If accented characters are introduced using for example ISO 8859-1 code, the problems encountered in steps i and ii above are amplified but they share the same causes. iv. If tables are reorganized to make all related characters contiguous, one might think it would permit a simplified single-character sort, but this does not work either. Take upper and lower case unaccented letters as an example. If code point 01 is assigned to �a�, code point 02 assigned to �A�, code point 03 to �b�, code point 04 to �B� and so on, let's see what happens in a list sorted directly by these rearranged values:
Sorted Internal
List Values
aaaa 01010101
abbb 01030303
Aaaa 02010101
Abbb 02030303
This is predictable also, but obviously wrong in any country from a cultural point of view.
v. The only path of solution is to decompose the initial data in a way that will respect traditional lexical order, and at the same time ensure absolute predictability. For the Latin script, this necessitates at least four levels:
1. The first decomposition renders information to be sorted case insentitive and diacritical mark insensitive, and removes all special characters which have no preestablished order in any human culture:
An example using English:
"résumé" (an English word derived from French but with a very different meaning in French) becomes "resume", without any accent.
An example using French:
"Vice-légation" becomes "vicelegation", with no accent, no upper case and no dash.
An example using German:
"gro�" becomes "gross", with the sharp-s being converted to double-s to render it case insensitive.
In Spanish or Scandinavian languages, some extra letters are added to the 26 fixed letters of the English, French and German alphabet, which are not ordered according to the expectations of this group of languages. This calls for adaptability.
2. The second decomposition breaks ties on quasi-homographs, strings that differ only because they have different diacritical marks. In the English example above, "resumé" and "résumé" are quasi-homographs. Traditional lexical order requires that "resume" always come before "résumé" (which sorting using only the first level would not guarantee). In this case, tradition does not say if "resumé" (another spelling) should come before "résumé", which would seem logical: English and German dictionaries only state that unaccented words precede the accented words.
Here another characteristic is introduced. In French, because of the large number of multiple quasi-homograph groups formed of more than 2 instances, main dictionaries follow a rule that is the following: accents are generally not taken into account for sorting, but in case of homographic ties, the last difference in the word determines the correct order between two given words, a priority order being then assigned to each type of accent. For example, "coté" should be sorted after "côte" but before "côté". This is easy to implement: a number is assigned to each character of original data to be sorted, representing either an accent or no accent at all, but these numbers are stacked instead of being added to a linear list: in other words, the resulting string is made starting from the last character of the original data and backward.
Example: to obtain the following order respecting this rule: "cote, "côte", "coté", "côté",numbers could be assigned indicating respectively �****�, �**c*�, �a***�, �a*c*�, where "*" means no accent, "a" means acute accent, "c" circumflex accent. Here this scheme is sufficient to break the tie correctly at this second level.
3. The third decomposition breaks ties for quasi-homographs different only because upper-case and lower-case characters are used. This time, the tradition is well established in English and German dictionaries, where lower case always precedes upper case in homographs, while the tradition is not well established in French dictionaries, which generally use only accented capital letters for common word entries. In known French dictionaries where upper and lower case letters are mixed, the capitals generally come first, but this is not an established and stated rule, because there are numerous exceptions. So for a default template it is advisable to use English and German traditions, if one wants to group the largest possible number of languages together. Let's note here by the way that in Denmark, upper case comes before lower case, a different but well established rule. This is a second fact calling for adaptability in the model used in this standard.
Example: to have the following order: "august", "August", numbers could be assigned indicating respectively �llllll�, �ulllll�, where "l" means lower case and "u" upper case.
4. The fourth decomposition breaks the final tie that does not correspond to any tradition, the tie due to quasi-homographs that differ only because they contain special characters. Breaking this tie is essential to ensure the absolute predictibility of sorts and also to be able to sort strings composed only of special characters. Since the traces of special characters were removed from the original data to form the three first orders of decomposition, simply putting them in row in the fourth order of decomposition would mean that their position would be lost. These positions are quite important to solve remaining ties and in consequence we must retain here the original positions of these special characters: two quasi-homographs could each contain a common special character in different positions and thus be strictly different (ex.:"ab*cd" is still different from "a*bcd" despite they share one and only one common special character).
Example: to have the following order: "coop", "co-op", "coop-", numbers could be assigned respectively according to the following pattern: �d�, �d3-� and �d5-�, where "d" is an always-present delimiter that separates this decomposition from the first three in case all four decompositions are to be concatenated to form a single sorting key based on numeric values (see discussion in the next paragraph). "3-" means a dash in position 3 of the original string. "5-" means a dash in position 5, and so on.
These four decompositions can be structured using a four level key, concatenating the subkeys from the highest significance to the lowest. If coded assignment of numbers is done properly, instead of necessitating a cumbersome exception process for dealing with homographs, all decompositions may be made at once and resulting strings concatenated and passed through a standard sort program sorting in numeric order. To attain this result, it is sufficient that numbers chosen for the first decomposition code set be greater than numbers chosen for the second one, the second one's greater than the third one's, and that the delimiter chosen for the fourth decomposition be less than the lowest possible number coded elsewhere for the sort (delimiter called logical zero), in which case no restriction applies to the content of the fourth decomposition. An easier implementation might just choose to put the lowest value possible as a delimiter between each subkey, in which case no restriction ever applies.
This method has been fully described with tables for the first time in Règles du classement alphabétique en langue française et procédure informatisée pour le tri, Alain LaBonté, Ministère des Communications du Québec, 19 août 1988, ISBN 2-550-19046-7.
Reduction techniques have been designed to considerably shorten space requirements. As no implementation is required to use specific numbers for weights and does not require reduction nor compression, this issue is outside the scope of this standard but it is interesting to note that implementation can be optimized. This has been improved over time and is highly feasible.
A plublic-domain reduction technique is described in details (with ample examples) in Technique de réduction - Tris informatiques à quatre clés, Alain LaBonté, Ministère des Communications du Québec, June 1989 (ISBN 2-550-19965-0).
vi. For a certain number of languages, the default presented in this standard will need to be adapted, both in the table values for the four orders of keys and in the potential context analysis processing necessary to achieve culturally correct results for users of these languages. To illustrate this, examples of dictionary sequences are given here for two languages which native order is not in the default table:
Traditional Spanish (note "ch" greater than "cu" and "ña" greater than "no"):
cuneo<cúneo<chapeo<nodo<ñaco
(Comparative French/English/German sort:
chapeo<cuneo<cúneo<ñaco<nodo)
Danish (note "a" less than "c", "cz" less than "cæ" and "cø", and "aa" equivalent to "å" greater than "z"):
Alzheimer<czar<cæsium<cølibat<Aalborg<Århus
(Comparative French/English/German sort:
Aalborg<Alzheimer<Århus<cæsium<cølibat<czar)
vii. It is important that in all coding environments, and in all programming environments, the order be consistent so that sort programs can give reliable results reuseable in programs; conversely, comparisons of two character strings where an order is expected should be be in line with results given by sort programs. Hence it is advisable that all processes which expect a given order all use the same comparison engine. This standard has built on this requirement that was not respected before.
Furthermore it should be possible to have access, externally, to the ultimate binary strings on which real comparison is made. This will allow old processes which can not be changed easily but which are able to sort raw binary data, to sort in a consistent way with new processes. This standard allows this.
Title ISO/IEC CD 14651 - International String Ordering -Method for comparing Character Strings and Description of a Default Tailorable Ordering
[ISO/CEI CD 14651 - Classement international de chaînes de caractères - Méthode de comparaison de chaînes de caractères et description d'un ordre de classement implicite adaptable]
This international standard defines:
- a method for doing deterministic and internationalized character string comparisons. The method is applicable on strings that exploit the full repertoire of ISO/IEC 10646 (independently of coding) or subsets, so that these comparisons be applicable for subrepertoires such as those of ISO 8859 variants, and in a given set of languages for each script
- a specific default order description using the preceding specification for the ISO/IEC 10646 characters; this default is based, for each given script, on an order which is culturally acceptable to a maximum of users of that script.
It is to be considered normal practice that this default order be modified with a minimum of efforts to suit the needs of a local environment. The main benefit, worldwide, is that for other scripts, no modification will be required and that the order will remain consistent and predictable from an international point of view.
ISO/IEC CD 14652 Cultural Conventions Specification
ISO/IEC 10646-1 Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane
For the purposes of this International Standard, the following definitions apply:
API for the purpose of this internationational standard, an API, or Application Program Interface, is an application process described alongside with its input and output used as an interface to a specific service common to an application environmentcanonical form the coding of a UCS character in 4 octet binary form according to ISO/IEC 10646-1
character string A data type defined by the concatenation of a series of characters in logical sequence
collation ordering of elements
collating symbol a symbol used to specify weights assigned to a character in a symbolic fashion rather than absolutely
collating element the smallest entity used to determine the logical ordering of strings. It normally consists of either a single character, or two or more characters collating as a single entity
concatenation logical operation which consist in adding an element at the end of a string to consider the result as a new string, longer than the firts one (it is like adding a new wagon to the tail of a train)
engine a set of APIs
field for the purposes of this International Standard, a single character string or any other data type which may be ordered alone or in conjunction with other fields of a record, each field of a record being compared to the same field of another record; in case of absolute equality of two equivaelnt fields, other fields of the records will have to be compared to eventualy solve a tie.
first order token an absolute number used as a comparison element, obtained out of tables for the first level that describes a character; note that some characters, such as ligatures, may lead to more than one token for a character at one given level
fourth order token level an absolute number used as a comparison element, obtained either out of tables for the fourth level that describes a special character, or precising the position of the special character represented in the original string; tokens of the fourth order level are always in pairs, the first token being a position, the second one being a weight for the character represented
graphic character a character, other than a control function, that has a visual representation normally handwritten, printed, or displayed
level degree of precision of a comparison; normally a weight is assigned to each character of a character string which must be compared to another one at a given level of precision; when comparison does not break ties at this level, then another weight is assigned to each character of the string at the next level of precision (see actual example in 5.2.1.1)
numeric relative value the relative value of a given weight, or token, compared to other ones, in its final numeric and processable form
ordering a process in which a set of fields composing a record are assigned a given order relative to any other set of fields composing other records of a file
ordering key a series of bits, the numerical value of which determines its order; to a character string may be allocated a series of tokens, which correspond, level by level, to weights assigned to characters
ordering subkey a sub-series of bits in an ordering key - an ordering subkey corresponds to the set of tokens corresponding to the weights of a character string assigned to a given level of precision
posthandling a process in which an ordering subkey is processed internally after the straightforward comparisons done according to the APIs defined in this standard
prehandling a process in which character strings are modified internally to lead to straightforward comparisons according to the APIs defined in this standard
record the exhaustive structured set of fields that form a monolithic block in a file, this set belonging together according to specific application-defined requirements
reference string in a comparison operation, the string which serves as a base reference for comparison; the string to which another string is compared
referenced string in a comparison operation, the string which is being compared to a reference string
second order token an absolute number used as a comparison element, obtained out of tables for the second level that describes a character; note that some characters, such as ligatures, may lead to more than one token for a character at one given level
string a series of individual elements which form a whole, when they are concatenated, i.e. linked together like wagons in a train; a character string is a series of characters, a bit string is a series of bits
symbolic relative value the relative value of a given weight, or token, compared to other ones, in its symbolic, human-readable text format
telephone-book-type transformation a specific type of transformation in which fields are rebuilt internally before a straightforward ordering can be done; this process may involve replication of fields in many different forms for the purposes of multiple indexing
transformation An operation done prior to comparison or ordering, the outcome of which, for the purposes of this International Standard, could lead to producing many character strings or a modified character string, out of an original one to be sorted, indexed or compared; it is modifying the data in a kind of one or many explicit classes, new explicit formats that may be different from original character strings (ex.: given the number "5", the outcome of transformation as defined here may result in triplicating the original into the string "cinq" in French, the string "fünf" in German and the string "five" in English, the three strings then being ready for ordering)
third order token an absolute number used as a comparison element, obtained out of tables for the third level that describes a character; note that some characters, such as ligatures, may lead to more than one token for a character at one given level
If a character outside of the standard repertoire of ISO/IEC 10646 is to be used in tailored ordering tables, it is recommended that the code-independent symbol identifying this character use the form <U8xxxxxxx> for documentary purposes indicating its nonstandard nature. The binding of these symbols refering to nonstandard characters to actual coding is done through a repertoiremap as defined in ISO/IEC 14652. If, for example, actual UCS coding is used, then private zones of this character set will normally be used, and binding then is normally specified in this way.
Whenever possible, in the default ordering table, glyphs are used in comments alongside with character ordering definitions. This gives a more accurate understanding of characters in question. It is understood that these glyphs may be removed in machine-readable files.
The collating-symbol statements will include declarations of symbols used as intermediary values for:-possible collating elements that are composed of sequences of graphic characters. An example is tailoring the default to Danish. The digraph "aa" is composed of a sequence of two graphic characters which, in Danish, are considered as a single letter of the alphabet and require a single ordering definition.
-possible collating elements that require an intermediate definition for other reasons
For easy cross-referencing the various weights, numeric relative values (informative) will be shown in the table as comments. A system of short mnemonics intended to replace glyphs when it is not possible to transmit them will also be used in tables alongside with glyphs representing characters, whenever possible.
This international standard can be implemented with different levels of increasing complexity and refinement corresponding to its different conformance levels. Hence the first level is limited to the comparison engine with a fixed equivalence precision (equivalences are limited to absolute equality), the second level requires the possibility to invoke a tailored ordering table using the default table as a template, the third level introduces the prehandling and posthandling of data to be compared, the fourth level allows the possibility to parametrically use different tables, and the fifth level requires the parametric processing of equivalences with different degrees of precision. Levels of conformance applicable to each requirement are specifically indicated in the following clauses.
This requirement shall be met for conformance levels greater than 1.
It is recommended that tailoring be done starting with the default table described in annex 1. This tailoring shall be done according to the specification of ISO/IEC 14652.
The symbolic table, as provided in annex 1 or as modified in a tailoring process, shall be presented in a numeric form to the comparison operation engine described hereafter. The table handled by the comparison operation engine shall consist of a matrix of n lines by m columns, n being the number of characters in the character set used and m being the number of levels provided in the symbolic data, each element of the matrix being a numerical token indicating a relative weight. The exact values used for the weights and how these numbers are represented internally is implementation-defined. However the values used shall respect the order specified in the symbolic table data.
As prehandling can be done on many different tables in a given application environment, each one shall be identified by a name in this environment. Conformance level 4 requires that a parameter in API 1 and 3 be used to invoke the name of the table to be used.
This requirement shall be met for conformance levels greater than 2.
It may be necessary to transform a field before the actual ordering process can begin. This process is called prehandling. The implementor is responsible for ensuring that prehandling has been done prior to the ordering process. For examples of how applications can take advantage of prehandling, see Annex B. This is a global operation that may involve exploding records before ordering them. Therefore, the prehandling phase, unlike its posthandling counterpart, shall be done on a whole set of input records before any comparison is made. Thus, prehandling is not part of the comparison operation engine. The comparison operation engine will not contain any default method related to prehandling. However prehandling functionality shall be provided to the user by the application developer for allowing the use of this international standard in higher layers of the application.
The prehandling phase shall, as a minimum, transform the actual coded characters used on input in a coding consistent with the internal tables used by the comparison operation. If the actual coding in use corresponds to the coding assumed in the internal tables of the comparison engine, and that no other prehandling is required, then the prehandling phase can correspond to an empty process.
This requirement shall be met for all conformance levels. For conformance level 1 only, API 1 can be implemented without providing external visibility for the combined group of underlying APIs 2 & 3, and vice versa: API 1 gives results which are functionally equivalent to combined APIs 1 and 2 if one limits the requirements to the character string comparison operation. For all conformance levels less than 5, certain limits, explicitly identified, are applicable.
This operation atomically consists in three API which are described in what follows.
Note: In the following descriptions, numbers (integers) are used. These numbers are not mandatory values to be used in actual implementations. The developer may choose the values that fit the application best.
The API names proposed in this standard shall be used for binding these APIs with actual implementations.
This API shall operate as follows:
0. Name of this API for binding purposes: COMPCAR
1. Parameters Set of parameters a (input parameters):
string 1 (a1): referenced coded character string string 2 (a2): reference coded character string
Parameter b (input parameter): level of precision (value fixed to 0 for any conformance level smaller than 5):
a value between 0 and n; this value determines after which level the binary comparison should stop if equivalence (approximate equality) is going to be detected, in case the two parameters of set a are not absolutely equal. In all cases of inequality, the operation is performed on all available levels to determine if the referenced character string 1 (a1) comes before or after the reference character string 2 (a2).
If the specified level of precision is zero or greater than the last available level, each of the missing levels is implicitly considered to bear the smallest value possible for the latter levels for comparison purposes. The meaning of values less than zero for this parameter is reserved in this standard for future use.
As an example, consider the words "alpha" and "ALPHA". These words are equivalent at level 1 (alphabetic) and level 2 (diacritical marks). However, the words are different at level 3, where case is taken into account. If comparisons are requested up to level 2, then approximate equality will result. If level 3 or greater is required, then the two character strings will be considered different, and unequivalent.
Parameter c (output parameter):
A number is returned whose sign determines an order even in case of equality. It is to be noted that the values mentioned here are to be considered as logical values. Implementations are free to use other values. However functionality shall be the same as described here.
The sign has the following meaning:
negative sign: referenced character string comes before reference.
positive sign: referenced character string comes after reference.
Note: In the case of absolute equality, a negative sign is returned by convention. In the case of equivalence, both a negative or a positive sign are possible, because character strings are also unequal in addition to being equivalent.
The absolute value of the number determines the possible following cases:
case 1: absolute equality (even if case equivalence has been detected; this goes beyond equivalence)
case 2: equivalence (at precision level required by parameter c)
case 3: values compared significantly unequal or unequivalent
Set of parameters d (output parameters):
(this parameter is not required for conformance level 1)
two bit strings to be returned. Each of these bit strings shall be coded in such a way that in the hardware and software environment where it is used, each of the bit strings can be compared in a straightforward binary fashion without further analysis and the order of the comparison be equivalent to the order obtained by the sign result of parameter c. The structure of the bits strings chosen by the implementation shall also allow an external process to delimit the different levels. It is the responsibility of the implementer to choose the appropriate method to meet this requirement.
The bit strings are:
bit string 1: processable binary string corresponding to the character string 1 of the set of parameters a.
bit string 2: processable binary string corresponding to the character string 2 of the set of parameters a. Parameter e (input parameters):
(this parameter is required for conformance levels greater than 3)
name of the ordering table to be provided as parameter c to API 3.
Process of API 1 (COMPCAR)
This API shall be processed to give results equivalent to the following:
1. Convert character string 1 and character string 2 through API 3 (CARABIN) into bit string 1 and bit string 2. Return the result of the conversion the set of parameters d.
2. Operate API 2 (COMPBIN) to get comparison result in parameter c.
Binding considerations
This API can be used to perform a function suited to the exact specification of the C language functions strcoll() which is less general in scope. In the same way, it can be used directly or interfaced to perform similar functions for the specific requirements of other programming languages.
This API shall operate as follows:
0. Name of this API for binding purposes: COMPBIN
1. Parameters Set of parameters a (input parameters):
bit string 1 (a1): referenced predigested bit string bit string 2 (a2): reference predigested bit string
Parameter b (input parameter): level of precision:
level of precision (value fixed to 0 for any conformance level smaller than 5):
a value between 0 and n; this value determines after which level the binary comparison should stop if equivalence (approximate equality) is going to be detected, in case the two parameters a are not absolutely equal. In all cases of inequality, the operation is performed on all available levels to determine if the referenced parameter a1 comes before or after the reference parameter a2.
If the specified level of precision is zero or greater than the last available level, each of the missing levels is implicitly considered to bear the smallest value possible for the latter levels for comparison purposes. The meaning of values less than zero for this parameter is reserved in this standard for future use.
See discussion in description of parameter b of API 1.
Parameter c (output parameter):
A number is returned whose sign determines an order even in case of equality. It is to be noted that the values mentioned here are to be considered as logical values. Implementations are free to use other values, provided. However Functionality shall be the same as described here.
The sign has the following meaning:
negative sign: referenced string 1 comes before reference b2
positive sign: referenced string 1 comes after reference b2
Note: In the case of absolute equality, a negative sign is returned by convention. In the case of equivalence, both a negative or a positive sign are possible, because strings are also unequal in addition to being equivalent.
The absolute value of the number determines the possible following cases:
case 1: absolute equality (even if case equivalence has been detected; this goes beyond equivalence)
case 2: equivalence (at precision level required by parameter c)
case 3: values compared significantly unequal or unequivalent
Process of API 2 (COMPBIN)
This API shall be processed to give results equivalent to the following:
1. Compare bit string 1 to bit string 2 numerically.
2. In case of absolute equality, return case 1 with a negative value.
3. In case of inequality retain which string comes before the other one.
4. Remake comparisons in ignoring the levels that are not significant according to parameter b.
5. In case of equality, return case 2 after adjustment of the sign in function of the order retained in step 3. This indicates an equivalence acccording to programmed specifications.
6. In case of unequality, return case 3 after adjustment of the sign in function of the order retained in step 3.
Note: there are more efficient ways to accomplish what precedes. The steps indicated here are logical steps to help clarify functionality. No specific process is mandated by this standard. What shall be respected is that results shall be the same as what is indicated above.
Binding considerations
This API can be used to perform a function suited to the exact specifications of the C language functions strcmp() or strncmp() which are less general in scope. In the same way, it can be used directly or interfaced to perform similar functions for the specific requirements of other programming languages.
This API shall operate as follows:
0. Name of this API for binding purposes: CARABIN
1. Parameters Parameter a (input parameter):
a coded character string
Parameter b (output parameter): a structured binary string
This bit string shall be coded in a way that is equivalent to formation of the multilevel binary keys described in 5.3. Such a bit string is the result of the digestion of the input character string through the transformation table described in normative annex 1, with the provision that this table can have been tailored according to 5.5.
The structure of the bit strings chosen by the implementation shall also allow an external process to delimit the different levels. It is the responsibility of the implementer to choose the appropriate method to meet this requirement.
Parameter c (input parameter):
(this parameter is required for conformance levels greater than 3)
name of the ordering table to be used for the conversion.
Process of API 3
This API shall be processed to give results equivalent to the following:
1. Digest input character string into a binary character string respecting the requirements of clause 5.3
2. Return the binary string obtained as parameter b.
Binding considerations
This API can be used to perform a function suited to the exact specs of the C language function strxfrm(). In the same way, it can be used directly or interfaced to perform the same function for the specific requirements of other programming languages.
The default table or tailored tables shall be used at conformance levels greater than 1. For conformance level number 1, although it is recommended that the default table be used, it is not required and any fixed multilevel table can be used.
The user is responsible for tailoring the ordering table to the application's requirements. If there is no tailoring done, then the default table shall be used. The default table is acceptable for one or more natural languages of each of the writing systems explicitly described. Adaptations may be necessary for specific languages using one or many of these scripts.
See section 5.8 for the tailoring mechanism whose results are used by the comparison operation engine.
The character transformation table can be considered as a matrix of n lines. N is the number of characters in the repertoire. In each line 4 levels are described by default. This default can be extended in the tailoring phase by the end-user. Any conforming implementation shall have provisions for handling a depth level of at least 7 levels. The user shall take care that in case of tailoring, levels be adjusted so that script <SPECIAL>, whose ordering is done at the last level in the default, be normally processed separately and at the last level, even if the maximum number of levels specified after tailoring is not equal to 4. This will avoid collisions with eventual extra levels added by tailoring. It is highly recommended that only four levels be used in tailoring, the fourth one being the level reserved to special characters. This is the only way this standard can guarantee that nothing will be broken; otherwise thorough and skillfull thinking by the implementer will be required, the minimum being
The table is separated into sections, one section for each script. Each section is assigned a sequential number corresponding to its order of apparition. The header of each section is named for clarity. The header describes transformation properties for each level of the script. These properties are tailored for the peculiarities of the script relative to the ordering process.
One of the tailoring possibilities is to change the relative order of a whole script relative to other scripts. Separation of the table into named sections will simplify that requirement, as well as serving to describe script properties.
The scanning direction (forward or backward) used to process the string at each level is a property of each script. These properties can be changed according to the language. Clause 5.5 describes tailoring.
One of the properties is also the possibility to assign a comparison on the numerical value representing the position of each character of two strings, before comparing weights assigned to the characters.
Note: The scanning direction (forward or backward) is not normally related to the natural writing direction of a script. The scanning direction applies only to the order processing in relation with the logical sequence of the coded character string.
According to ISO/IEC 10646, for scripts written right to left, such as Arabic, the lowest positions in the logical sequence of characters correspond to the rightmost characters of a string (from the point of view of their natural sequence). Conversely, for the Latin script, written left to right, the lowest positions in the logical sequence of characters correspond to the leftmost characters of the string (from the point of view of their natural presentation sequence).
Therefore, scanning forward starts with the lowest positions in the logical sequence, while scanning backward starts from the highest positions.
Now, in order to precise what was just said, in ISO/IEC 10646, Arabic is artificially separated in two scripts: the logical, intrinsic Arabic, coded independently of shapes, and the presentation forms. Both allow to code Arabic completely, but intrinsic Arabic is normally prefered for better processing, while the second is prefered by some presentation-oriented applications.
Intrinsic Arabic is coded in the logical order, while presentation forms are coded in presentation order. The first of these two scripts is described in the default under the header <ARABINT>, standing for the normal coding, called intrinsic Arabic. The second one is described under the header <ARABFOR>, standing for Arabic forms. Scanning properties of these two artificial sections differ, the firts one being csanned forward, the second one being scanned backward, for the first three default levels.
A series of m subkeys is formed out of a character string composing a comparison field ; m is the maximum number of levels described in either the default ordering table or the tailored ordering table. The following paragraphs describe these formations. In the default table, m is equal to 4.
For each character string, a corresponding vector is built (another bit string) which is not used in the comparison process and which describes to which script each character of the input character string belongs. This data will be used subsequently to determine how each token of each subkey is formed.
During forward scanning of each character of the input character string, a token is concatenated to the script identifier vector, which is initially empty. The token corresponds to the value assigned to the script to which the character definition of the character in process belongs. The value of the script is the logical number assigned implicitly to the script name header of the table section in which is located the character definition. If, due to tailoring, the character definition is moved before or after another character definition, it becomes part of the script whose name header comes before the new character definition.
For i varying from 1 to m minus 1 (from 1 to 3 if the default is used), form subkey level i in the following way:
During forward scanning of each character of the input character string, a token is obtained. The token corresponds to the transformation value of that character at level i.
Note: In the default definition, characters of script <SPECIAL> are ignored from level 1 through 3. The definition of these characters can be been tailored to make them any of these characters a part of another script. The script <SPECIAL> is the first script to be defined in the default table. It contains special characters that are not, stricto sensu, a specific part of any natural language script - for example, "dingbats" of ISO/IEC 10646, or punctuation for most scripts.
The scanning properties for the level i being processed requires to be carefully monitored. When there is a change in scanning direction at level i and the new direction is backward, stacking of the token will be done at the position where the change of direction has occurred. Therefore when such a condition occurs, the application shall retain the current position in the output subkey i as position p (push position).
According to scanning direction assigned to the level i of the script whose identification corresponds to the character being processed, the obtained token is either added (concatenated) at the end of subkey i (which behaves like a list), or pushed at position p of subkey i (which then behaves like a stack). Subkey i is initially empty.
This is the equivalent of backward or forward scanning of the input string for that level. This property of scanning direction is given for each level of each script and is a script property. Each script header gives, for each level, the scanning direction property of the script.
Normally, in alphabetic scripts (and in the default), levels represent the following decomposition for each character:
level 1: base level of each script. This level corresponds to the basic letters of the alphabet for that script, if the script is alphabetic, and to each character of the script if the script is ideographic or syllabic;
level 2: the level corresponding to diacritical marks affecting each basic character of the script. For some scripts, diacritics are always considered an integral part of the basic letters of the alphabet, and are not considered at this second level, but rather at the first. For example, N TILDE in Spanish is considered a basic letter of the Latin script. Therefore, tailoring for Spanish will change the definition of N TILDE from "the weight of an N in the first level and a tilde weight in the second level" to "the weight of an N TILDE (placed after N and before O) in the first level, and indication of the absence of extra diacritics in the second level"
level 3: the level corresponding to case or to variant character shape that affects each basic character of the script Note: whatever the number of levels, except for level m, tailorable tables should not assign values for a given character to any level greater than 1 if no value is assigned to level 1. Otherwise full predictability of the results would not be guaranteed by this International Standard.
During forward scanning of each character of the input character string, a pair of tokens is concatenated to subkey level m. The first token of the pair corresponds to the logical position in the original character string of the character being processed. The second token in the pair corresponds to the value assigned that character at level m of the table. When the character is not assigned at level m in the table, it is ignored for the formation of subkey level m and no pair is concatenated. The pair of tokens is concatenated immediately after subkey level m. Subkey level m is initially empty. This level represents the level common to all scripts. In this standard, this level is considered as the first script (under the header <SPECIAL>). The property of this level is positional in an absolute way. This means that the numerical value of the position in the original string has precedence over the weight assigned to the special character which occupies this position. This means that subkey level m is composed of a pair of values for each such character (the character string being always scanned forward in the logical string sequence). The first value of the pair corresponds to the sequential position of the character in the input string. The second value of the pair corresponds to the weight assigned to the character according to level m in script <SPECIAL>.
In the table, this behaviour is described using the parameter couple "forward, position". To be conformant to this international standard, the parameter couple "backward, position" shall never be specified for level m. These two parameters shall be considered mutually exclusive.
In the default table, the first script (whose header is named <SPECIAL>) exclusively includes characters that are not considered part of the set of basic characters of any script - for example, special characters such as SPACE, HYPHEN, and "dingbats" of ISO/IEC 10646.
In the default table, definitions of these characters for levels 1 to 3 are such that they are ignored at these levels and values are exclusively assigned to level m (m being equal to 4 in the default).
This extra clause has been removed from previous drafts. It was intended for processing combining characters dynamically. There are more static solutions possible which will require tailoring if combining sequences are to be processed as single collating elements.
This requirement shall be met for conformance levels greater than 2.
The posthanding phase is part of the formation of a binary comparison key. Once the binary key has been formed out of the data specified in the table, the posthandling phase shall be invoked (see discussion about the potential purposes of such a phase in annex B). The result of the posthandling phase shall be returned as subkey level m+1 (m=5 in the default table).
Table 1 through 4 are formed out of the LC_COLLATE specification data described in the following paragraphs. Each of the collating element definition of the default contains 4 explicit values. Each value corresponds to an internally-used token.
Normative Annex 1 gives the international default ordering table used as a template for tailoring localized applications working on the full repertoire of ISO/IEC 10646 (the Universal multi-octed coded character set).
A programming language or an application conforming to this international standard shall respect all the requirements of clause 5 of this document. Five levels of conformance are provided. The clauses of section 5 explicitly identify which requirements shall be met at each conformance level. The conformance levels have a cumulative effect: a given level of conformance implies that all requirements of lowest conformance levels shall be met. The different conformance levels can be abstracted as follows:
Level 1: only a limited implementation is required, as follows: -fixed precision for equivalences, then limited to absolute equality; -API 1 can be implemented without the combined APIs 1 and 2 or vice versa; -the binary strings usable for direct comparisons do not have to be returned if API 1 is used
Level 2: the possibility to invoke tables tailored from the default table is required;
Level 3: prehandling and posthandling of data to be compared is required;
Level 4: the possibility to designate a particular ordering table is required;
Level 5: processing of equivalences at different levels of precision is required.
The symbolic data specified in the default table of Annex 1 is conformant to the specification described in ISO/IEC 14652. Explanation of the structure of this table and of its different parameters is to be found in that International standard.
International standard ISO/IEC 14652 specifies how the symbolic data of the table described in Annex 1 can be tailored accoriding to local user requirements.
Note: In this draft, annexes identified with a digit are intended to be normative. Annexes identified with a letter are intended to be informative.
LC_COLLATECOLL_WEIGHT_MAX=4
# Déclaration des systèmes d'écriture / Declaration of scripts
script <SPECIAL> script <LATIN> script <ARABINT> script <ARABFOR> script <HEBREU> script <GREC> script <CYRIL> script <HAN>
# Déclaration des symboles internes / Declaration of internal symbols # # SYMB N� Expl. # collating-symbol <RES-1> # # <ARABINT>/<ARABFOR> # # collating-symbol <ANO> # 2 normal --> voir/see <MIN> collating-symbol <AIS> # 3 isol. collating-symbol <AFI> # 4 final collating-symbol <AII> # 5 initial collating-symbol <AME> # 6 medial/m<e'>dian # collating-symbol <MIN> # 7 minuscule/minuscule (bas de casse/lower case) collating-symbol <IMI> # 8 inférieur min./subscript min. (indice/index) collating-symbol <EMI> # 9 supér. min./superscript min. (exposant/exponent) collating-symbol <CAP> # 10 capitale/capital (haut de casse/upper case) collating-symbol <AMI> # 8 minuscule grecque/Greek lower case collating-symbol <ICA> # 11 inférieur en capitale/subscript capital collating-symbol <ECA> # 12 supérieur en capitale/superscript capital # # <ARABINT>/<ARABFOR> # collating-symbol <AMA> # 13 accent madda collating-symbol <AHA> # 14 accent hamza collating-symbol <AHW> # 14-1 accent hamza/waw collating-symbol <AHS> # 14-2 accent hamza under / hamza souscrit collating-symbol <AYE> # 14-3 accent under yeh / accent souscrit du ya' collating-symbol <YBA> # 14-4 accent hamza/yeh barree # collating-symbol <BAS> # 15 de base/basic (non accentué/non-accented) # collating-symbol <PCL> # 16 particulier/peculiar collating-symbol <LIG> # 17 ligature/ligature collating-symbol <ACA> # 18 accent aigu/acute accent collating-symbol <GRA> # 20 accent grave/grave accent collating-symbol <BRE> # 21 brève/breve collating-symbol <CIR> # 22 accent circonflexe/circumflex accent collating-symbol <CAR> # 23 caron/caron collating-symbol <RNE> # 24 rond supérieur/ring above collating-symbol <REU> # 25 tréma/diaeresis (ou/or umlaut) collating-symbol <DAC> # 26 double ac. aigu/double acute ac. collating-symbol <TIL> # 27 tilde/tilde collating-symbol <PCT> # 28 point/dot collating-symbol <OBL> # 29 barre oblique/oblique collating-symbol <CDI> # 30 cédille/cedilla collating-symbol <OGO> # 31 ogonek/ogonek collating-symbol <MAC> # 32 macron/macron # # GREC # collating-symbol <TNS> # accent aigu/tonos/acute accent collating-symbol <DLT> # tr<e'>ma/dialytica/diaeresis collating-symbol <DTT> # dialytika tonos # collating-symbol <0> collating-symbol <1> collating-symbol <2> collating-symbol <3> collating-symbol <4> collating-symbol <5> collating-symbol <6> collating-symbol <7> collating-symbol <8> collating-symbol <9> # collating-symbol <a> collating-symbol <b> collating-symbol <c> collating-symbol <d> collating-symbol <e> collating-symbol <f> collating-symbol <g> collating-symbol <h> collating-symbol <i> collating-symbol <j> collating-symbol <k> collating-symbol <l> collating-symbol <m> collating-symbol <n> collating-symbol <o> collating-symbol <p> collating-symbol <q> collating-symbol <r> collating-symbol <s> collating-symbol <t> collating-symbol <u> collating-symbol <v> collating-symbol <w> collating-symbol <x> collating-symbol <y> collating-symbol <z> # # <ARABINT>/<ARABFOR> # collating-symbol <hamza> collating-symbol <alef> collating-symbol <beh> collating-symbol <peh> collating-symbol <teh_marbuta> collating-symbol <teh> collating-symbol <tteh> collating-symbol <theh> collating-symbol <jeem> collating-symbol <tcheh> collating-symbol <hah> collating-symbol <khah> collating-symbol <dal> collating-symbol <ddal> collating-symbol <thal> collating-symbol <reh> collating-symbol <rreh> collating-symbol <zain> collating-symbol <jeh> collating-symbol <seen> collating-symbol <sheen> collating-symbol <sad> collating-symbol <dad> collating-symbol <tah> collating-symbol <zah> collating-symbol <ain> collating-symbol <ghain> collating-symbol <feh> collating-symbol <qaf> collating-symbol <kaf> collating-symbol <keheh> collating-symbol <gaf> collating-symbol <lam> collating-symbol <meem> collating-symbol <noon> collating-symbol <noon_ghunna> collating-symbol <heh> collating-symbol <heh_yeh> collating-symbol <waw> collating-symbol <alef_maksura> collating-symbol <yeh_barree> # # <HEBREU> # collating-symbol <alef> collating-symbol <bet> collating-symbol <gimel> collating-symbol <dalet> collating-symbol <he> collating-symbol <vav> collating-symbol <zayin> collating-symbol <het> collating-symbol <tet> collating-symbol <yod> collating-symbol <kaf_fin> collating-symbol <kaf> collating-symbol <lamed> collating-symbol <mem_fin> collating-symbol <mem> collating-symbol <nun_fin> collating-symbol <nun> collating-symbol <samekh> collating-symbol <ayin> collating-symbol <pe_fin> collating-symbol <pe> collating-symbol <tsad_fin> collating-symbol <tsadi> collating-symbol <qof> collating-symbol <resh> collating-symbol <shin> collating-symbol <tav> # # GREC # collating-symbol <ALPHA> collating-symbol <BETA> collating-symbol <GAMMA> collating-symbol <DELTA> collating-symbol <EPSILON> collating-symbol <ZETA> collating-symbol <ETA> collating-symbol <THETA> collating-symbol <IOTA> collating-symbol <KAPPA> collating-symbol <LAMBDA> collating-symbol <MU> collating-symbol <NU> collating-symbol <XI> collating-symbol <OMICRON> collating-symbol <PI> collating-symbol <RHO> collating-symbol <SIGMA> collating-symbol <TAU> collating-symbol <UPSILON> collating-symbol <PHI> collating-symbol <KHI> collating-symbol <PSI> collating-symbol <OMEGA> # # CYRIL # collating-symbol <CYR-A> collating-symbol <CYR-BE> collating-symbol <CYR-VE> collating-symbol <CYR-GHE> collating-symbol <CYR-DE> collating-symbol <CYR-GZHE> collating-symbol <CYR-DJE> collating-symbol <CYR-IE> collating-symbol <UKR-IE> collating-symbol <CYR-IO> collating-symbol <CYR-ZHE> collating-symbol <CYR-ZE> collating-symbol <CYR-DZE> collating-symbol <CYR-I> collating-symbol <UKR-I> collating-symbol <UKR-YI> collating-symbol <CYR-IBRE> collating-symbol <CYR-JE> collating-symbol <CYR-KA> collating-symbol <CYR-EL> collating-symbol <CYR-LJE> collating-symbol <CYR-EM> collating-symbol <CYR-EN> collating-symbol <CYR-NJE> collating-symbol <CYR-O> collating-symbol <CYR-PE> collating-symbol <CYR-ER> collating-symbol <CYR-ES> collating-symbol <CYR-TE> collating-symbol <CYR-KJE> collating-symbol <CYR-TSHE> collating-symbol <CYR-OU> collating-symbol <CYR-OUBRE> collating-symbol <CYR-EF> collating-symbol <CYR-HA> collating-symbol <CYR-TSE> collating-symbol <CYR-TSHE> collating-symbol <CYR-DCHE> collating-symbol <CYR-SHA> collating-symbol <CYR-SHTSHA> collating-symbol <CYR-SIGDUR> collating-symbol <CYR-YEROU> collating-symbol <CYR-SIGMOUIL> collating-symbol <CYR-E> collating-symbol <CYR-YOU> collating-symbol <CYR-YA> # Ordre des symboles internes / Order of internal symbols # # SYMB. N� # <RES-1> <MIN> # forme de base (bas de casse, arabe intrinsèque, # hébreu intrinsèque, etc. # basic form (lower case, intrinsic Arabic # intrinsic Hebrew and so on) # # <ARABINT>/<ARABFOR> # #<ANO> # voir <MIN> <AIS> # isol. # 3 <AFI> # final # 4 <AII> # initial # 5 <AME> # medial/m<e'>dian # 6 # <IMI> # 7 <EMI> # 8 <CAP> # 9 <ICA> # 10 <ECA> # 11 <AMI> #alternate lower case/ # 12 # #minuscules spéciales après majuscules # <ARABINT>/<ARABFOR> # <AMA> # accent madda #13 <AHA> # accent hamza #14 <AHW> # accent hamza/waw #14 1 <AHS> # accent hamza under / hamza souscrit #14 2 <AYE> # accent under yeh / accent souscrit du ya' #14 3 <YBA> # accent hamza/yeh barree #14 4 # <BAS> # 15 # <PCL> # 16 <LIG> # 17 <ACA> # 18 <GRA> # 19 <BRE> # 20 <CIR> # 21 <CAR> # 22 <RNE> # 23 <REU> # 24 <DAC> # 25 <TIL> # 26 <PCT> # 27 <OBL> # 28 <CDI> # 29 <OGO> # 30 <MAC> # 31 # # GREC # <TNS> # accent aigu/tonos/acute accent <DLT> # tr<e'>ma/dialytica/diaeresis <DTT> # dialytika tonos
# <0> # 48 <1> # 49 <2> # 50 <3> # 51 <4> # 52 <5> # 53 <6> # 54 <7> # 55 <8> # 56 <9> # 57 # <a> # 97 <b> # 98 <c> # 99 <d> # 100 <e> # 101 <f> # 102 <g> # 103 <h> # 104 <i> # 105 <j> # 106 <k> # 107 <l> # 108 <m> # 109 <n> # 110 <o> # 111 <p> # 112 <q> # 113 <r> # 114 <s> # 115 <t> # 116 <u> # 117 <v> # 118 <w> # 119 <x> # 120 <y> # 121 <z> # 122 <th> # 122b # # <ARABINT>/<ARABFOR> # <hamza> <alef> <beh> <peh> <teh_marbuta> <teh> <tteh> <theh> <jeem> <tcheh> <hah> <khah> <dal> <ddal> <thal> <reh> <rreh> <zain> <jeh> <seen> <sheen> <sad> <dad> <tah> <zah> <ain> <ghain> <feh> <qaf> <kaf> <keheh> <gaf> <lam> <meem> <noon> <noon_ghunna> <heh> <heh_yeh> <waw> <alef_maksura> <yeh_barree>
# # <HEBREU> # <alef> <bet> <gimel> <dalet> <he> <vav> <zayin> <het> <tet> <yod> <kaf_fin> <kaf> <lamed> <mem_fin> <mem> <nun_fin> <nun> <samekh> <ayin> <pe_fin> <pe> <tsad_fin> <tsadi> <qof> <resh> <shin> <tav> # #GREC # <ALPHA> <BETA> <GAMMA> <DELTA> <EPSILON> <ZETA> <ETA> <THETA> <IOTA> <KAPPA> <LAMBDA> <MU> <NU> <XI> <OMICRON> <PI> <RHO> <SIGMA> <TAU> <UPSILON> <PHI> <CHI> <PSI> <OMEGA> # #CYRIL # <CYR-A> <CYR-BE> <CYR-VE> <CYR-GHE> <CYR-DE> <CYR-GZHE> <CYR-DJE> <CYR-IE> <UKR-IE> <CYR-IO> <CYR-ZHE> <CYR-ZE> <CYR-DZE> <CYR-I> <UKR-I> <UKR-YI> <CYR-IBRE> <CYR-JE> <CYR-KA> <CYR-EL> <CYR-LJE> <CYR-EM> <CYR-EN> <CYR-NJE> <CYR-O> <CYR-PE> <CYR-ER> <CYR-ES> <CYR-TE> <CYR-KJE> <CYR-TSHE> <CYR-OU> <CYR-OUBRE> <CYR-EF> <CYR-HA> <CYR-TSE> <CYR-TSHE> <CYR-DCHE> <CYR-SHA> <CYR-SHTSHA> <CYR-SIGDUR> <CYR-YEROU> <CYR-SIGMOUIL> <CYR-E> <CYR-YOU> <CYR-YA> order_start <SPECIAL>;forward;backward;forward;forward,position
# # Tout caractère non précisément défini sera considéré comme caractère spécial # et considéré uniquement au dernier niveau. # # Any character not precisely specified will be considered as a special # character and considered only at the last level. #
<U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
# # SYMB. N� GLY # <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP> <U005F> IGNORE;IGNORE;IGNORE;<U005F> # 33 _ <U0332> IGNORE;IGNORE;IGNORE;<U0332> # 34 <"_> <U00AF> IGNORE;IGNORE;IGNORE;<U00AF> # 35 - (MACRON) <U00AD> IGNORE;IGNORE;IGNORE;<U00AD> # 36 <SHY> <U002D> IGNORE;IGNORE;IGNORE;<U002D> # 37 - <U002C> IGNORE;IGNORE;IGNORE;<U002C> # 38 , <U003B> IGNORE;IGNORE;IGNORE;<U003B> # 39 ; <U003A> IGNORE;IGNORE;IGNORE;<U003A> # 40 : <U0021> IGNORE;IGNORE;IGNORE;<U0021> # 41 ! <U00A1> IGNORE;IGNORE;IGNORE;<U00A1> # 42 � <U003F> IGNORE;IGNORE;IGNORE;<U003F> # 43 ? <U00BF> IGNORE;IGNORE;IGNORE;<U00BF> # 44 � <U002F> IGNORE;IGNORE;IGNORE;<U002F> # 45 / <U0338> IGNORE;IGNORE;IGNORE;<U0338> # 46 <"/> <U002E> IGNORE;IGNORE;IGNORE;<U002E> # 47 . <U00B7> IGNORE;IGNORE;IGNORE;<U00B7> # 58 � <U00B8> IGNORE;IGNORE;IGNORE;<U00B8> # 59 � <U0328> IGNORE;IGNORE;IGNORE;<U0328> # 60 <";> <U0027> IGNORE;IGNORE;IGNORE;<U0027> # 61 ' <U2018> IGNORE;IGNORE;IGNORE;<U2018> # 62 <'6> <U2019> IGNORE;IGNORE;IGNORE;<U2019> # 63 <'9> <U0022> IGNORE;IGNORE;IGNORE;<U0022> # 64 " <U201C> IGNORE;IGNORE;IGNORE;<U201C> # 65 <"6> <U201D> IGNORE;IGNORE;IGNORE;<U201D> # 66 <"9> <U00AB> IGNORE;IGNORE;IGNORE;<U00AB> # 67 � <U00BB> IGNORE;IGNORE;IGNORE;<U00BB> # 68 � <U0028> IGNORE;IGNORE;IGNORE;<U0028> # 69 ( <U207D> IGNORE;IGNORE;IGNORE;<U207d> # 70 <(S> <U0029> IGNORE;IGNORE;IGNORE;<U0029> # 71 ) <U207E> IGNORE;IGNORE;IGNORE;<U207E> # 72 <)S> <U005B> IGNORE;IGNORE;IGNORE;<U005B> # 73 [ <U005D> IGNORE;IGNORE;IGNORE;<U005D> # 74 ] <U007B> IGNORE;IGNORE;IGNORE;<U007B> # 75 { <U007D> IGNORE;IGNORE;IGNORE;<U007D> # 76 } <U00A7> IGNORE;IGNORE;IGNORE;<U00A7> # 77 � <U00B6> IGNORE;IGNORE;IGNORE;<U00B6> # 78 � <U00A9> IGNORE;IGNORE;IGNORE;<U00A9> # 79 © <U00AE> IGNORE;IGNORE;IGNORE;<U00AE> # 80 � <U2122> IGNORE;IGNORE;IGNORE;<U2122> # 81 <TM> <U0040> IGNORE;IGNORE;IGNORE;<U0040> # 82 @ <U00A4> IGNORE;IGNORE;IGNORE;<U00A4> # 83 � <U00A2> IGNORE;IGNORE;IGNORE;<U00A2> # 84 � <U0024> IGNORE;IGNORE;IGNORE;<U0024> # 85 $ <U00A3> IGNORE;IGNORE;IGNORE;<U00A3> # 86 � <U00A5> IGNORE;IGNORE;IGNORE;<U00A5> # 87 � <U002A> IGNORE;IGNORE;IGNORE;<U002A> # 88 * <U005C> IGNORE;IGNORE;IGNORE;<U005C> # 89 \ <U0026> IGNORE;IGNORE;IGNORE;<U0026> # 90 & <U0023> IGNORE;IGNORE;IGNORE;<U0023> # 91 # <U0025> IGNORE;IGNORE;IGNORE;<U0025> # 92 % <U207B> IGNORE;IGNORE;IGNORE;<U207D> # 93 <-S> <U002B> IGNORE;IGNORE;IGNORE;<U002B> # 94 + <U207A> IGNORE;IGNORE;IGNORE;<U207E> # 95 <+S> <U00B1> IGNORE;IGNORE;IGNORE;<U00B1> # 96 � <U00B4> IGNORE;IGNORE;IGNORE;<0> # 123 � <U0060> IGNORE;IGNORE;IGNORE;<1> # 124 ` <U0306> IGNORE;IGNORE;IGNORE;<2> # 125 <"(> <U005E> IGNORE;IGNORE;IGNORE;<3> # 126 ^ <U030C> IGNORE;IGNORE;IGNORE;<4> # 127 <"<> <U030A> IGNORE;IGNORE;IGNORE;<5> # 128 <"0> <U00A8> IGNORE;IGNORE;IGNORE;<6> # 129 � <U030B> IGNORE;IGNORE;IGNORE;<7> # 130 <""> <U007E> IGNORE;IGNORE;IGNORE;<8> # 131 ~ <U0307> IGNORE;IGNORE;IGNORE;<9> # 132 <".> <U00F7> IGNORE;IGNORE;IGNORE;<a> # 133 � <U00D7> IGNORE;IGNORE;IGNORE;<b> # 134 � <U2260> IGNORE;IGNORE;IGNORE;<c> # 135 <!=> <U003C> IGNORE;IGNORE;IGNORE;<d> # 136 < <U2264> IGNORE;IGNORE;IGNORE;<e> # 137 <=<> <U003D> IGNORE;IGNORE;IGNORE;<f> # 138 = <U2265> IGNORE;IGNORE;IGNORE;<g> # 139 </>=> <U003E> IGNORE;IGNORE;IGNORE;<h> # 140 > <U00AC> IGNORE;IGNORE;IGNORE;<i> # 141 � <U007C> IGNORE;IGNORE;IGNORE;<j> # 142 | <U00A6> IGNORE;IGNORE;IGNORE;<k> # 143 | <U00B0> IGNORE;IGNORE;IGNORE;<l> # 144 � <U00B5> IGNORE;IGNORE;IGNORE;<m> # 145 m <U2126> IGNORE;IGNORE;IGNORE;<n> # 146 <Om> <U220E> IGNORE;IGNORE;IGNORE;<o> # 147 <FP> <U250C> IGNORE;IGNORE;IGNORE;<p> # 148 <_V/>> <U252C> IGNORE;IGNORE;IGNORE;<q> # 149 <_V-> <U2510> IGNORE;IGNORE;IGNORE;<r> # 150 <_V<w> <U251C> IGNORE;IGNORE;IGNORE;<s> # 151 <_!/>> <U253C> IGNORE;IGNORE;IGNORE;<t> # 152 <_!-> <U2524> IGNORE;IGNORE;IGNORE;<u> # 153 <_!<> <U2514> IGNORE;IGNORE;IGNORE;<v> # 154 <_A/>> <U2534> IGNORE;IGNORE;IGNORE;<w> # 155 <_-A> <U2518> IGNORE;IGNORE;IGNORE;<x> # 156 <_A<> <U2502> IGNORE;IGNORE;IGNORE;<y> # 157 <_!> <U2500> IGNORE;IGNORE;IGNORE;<z> # 158 <_-> # <U2501> IGNORE;IGNORE;IGNORE;<U2501> # 159 <_=> <U2190> IGNORE;IGNORE;IGNORE;<U2190> # 160 <<-> <U2192> IGNORE;IGNORE;IGNORE;<U2192> # 161 <-/>> <U20D1> IGNORE;IGNORE;IGNORE;<U20D1> # 162 <"7> <U2191> IGNORE;IGNORE;IGNORE;<U2191> # 163 <-!> <U2193> IGNORE;IGNORE;IGNORE;<U2193> # 164 <-v> <U266A> IGNORE;IGNORE;IGNORE;<U266A> # 165 <_d!> <U2571> IGNORE;IGNORE;IGNORE;<U2571> # 166 <_/>//> <U2572> IGNORE;IGNORE;IGNORE;<U2572> # 167 <_<\> <U25E2> IGNORE;IGNORE;IGNORE;<U25E2> # 168 <_./>//> <U25E3> IGNORE;IGNORE;IGNORE;<U25E3> # 169 <_.<\> # # <ARABINT>/<ARABFOR> # <U060C> IGNORE;IGNORE;IGNORE;<U060C> <U061B> IGNORE;IGNORE;IGNORE;<U061B> <U061F> IGNORE;IGNORE;IGNORE;<U061F> <U0640> IGNORE;IGNORE;IGNORE;<U0640> <U066A> IGNORE;IGNORE;IGNORE;<U066A> <U066B> IGNORE;IGNORE;IGNORE;<U066B> <U066C> IGNORE;IGNORE;IGNORE;<U066C> <U066D> IGNORE;IGNORE;IGNORE;<U066D> <U064B> IGNORE;IGNORE;IGNORE;<U064B> #<fathatan_no> <UFE70> IGNORE;IGNORE;IGNORE;<UFE70> #<fathatan_is> <UFE71> IGNORE;IGNORE;IGNORE;<UFE71> #<fathatan_me> <U064C> IGNORE;IGNORE;IGNORE;<U064C> #<dammatan_no> <UFE72> IGNORE;IGNORE;IGNORE;<UFE72> #<dammatan_is>
<U064D> IGNORE;IGNORE;IGNORE;<U064D> #<kasratan_no> <UFE74> IGNORE;IGNORE;IGNORE;<UFE74> #<kasratan_is>
<U064E> IGNORE;IGNORE;IGNORE;<U064E> #<fatha_no> <UFE76> IGNORE;IGNORE;IGNORE;<UFE76> #<fatha_is> <UFE77> IGNORE;IGNORE;IGNORE;<UFE77> #<fatha_me>
<U064F> IGNORE;IGNORE;IGNORE;<U064F> #<damma_no> <UFE78> IGNORE;IGNORE;IGNORE;<UFE78> #<damma_is> <UFE79> IGNORE;IGNORE;IGNORE;<UFE79> #<damma_me>
<U0650> IGNORE;IGNORE;IGNORE;<U0650> #<kasra_no> <UFE7A> IGNORE;IGNORE;IGNORE;<UFE7A> #<kasra_is> <UFE7B> IGNORE;IGNORE;IGNORE;<UFE7B> #<kasra_me>
<U0651> IGNORE;IGNORE;IGNORE;<U0651> #<shadda_no> <UFE7C> IGNORE;IGNORE;IGNORE;<UFE7C> #<shadda_is> <UFE7D> IGNORE;IGNORE;IGNORE;<UFE7D> #<shadda_me>
<U0652> IGNORE;IGNORE;IGNORE;<U0652> #<sukun_no> <UFE7E> IGNORE;IGNORE;IGNORE;<UFE7E> #<sukun_is> <UFE7F> IGNORE;IGNORE;IGNORE;<UFE7F> #<sukun_me>
# # <HEBREU> #
<U05B0> IGNORE;IGNORE;IGNORE;<U05B0> #point_sheva <U05B1> IGNORE;IGNORE;IGNORE;<U05B1> #point_hataf_segol <U05B2> IGNORE;IGNORE;IGNORE;<U05B2> #point_hataf_patah <U05B3> IGNORE;IGNORE;IGNORE;<U05B3> #point_hataf_qamats <U05B4> IGNORE;IGNORE;IGNORE;<U05B4> #point_hiriq <U05B5> IGNORE;IGNORE;IGNORE;<U05B5> #point_tsere <U05B6> IGNORE;IGNORE;IGNORE;<U05B6> #point_segol <U05B7> IGNORE;IGNORE;IGNORE;<U05B7> #point_patah <U05B8> IGNORE;IGNORE;IGNORE;<U05B8> #point_qamats <U05B9> IGNORE;IGNORE;IGNORE;<U05B9> #point_holam <U05BB> IGNORE;IGNORE;IGNORE;<U05BB> #point_qubuts <U05BC> IGNORE;IGNORE;IGNORE;<U05BC> #point_dagesh <U05BD> IGNORE;IGNORE;IGNORE;<U05BD> #point_meteg <U05BE> IGNORE;IGNORE;IGNORE;<U05BE> #maqaf <U05BF> IGNORE;IGNORE;IGNORE;<U05BF> #point_rafe <U05C0> IGNORE;IGNORE;IGNORE;<U05C0> #paseq <U05C1> IGNORE;IGNORE;IGNORE;<U05C1> #point_shin_dot <U05C2> IGNORE;IGNORE;IGNORE;<U05C2> #point_sin_dot <U05C3> IGNORE;IGNORE;IGNORE;<U05C3> #sof pasuq order_start <LATIN>;forward;backward;forward;forward,position
# <U00A0> U0020;<BAS>;<MIN>;IGNORE # 170 <NBSP> # <U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0 <U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1 <U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2 <U0033> <3>;<BAS>;<MIN>;IGNORE # 174 3 <U0034> <4>;<BAS>;<MIN>;IGNORE # 175 4 <U0035> <5>;<BAS>;<MIN>;IGNORE # 176 5 <U0036> <6>;<BAS>;<MIN>;IGNORE # 177 6 <U0037> <7>;<BAS>;<MIN>;IGNORE # 178 7 <U0038> <8>;<BAS>;<MIN>;IGNORE # 179 8 <U0039> <9>;<BAS>;<MIN>;IGNORE # 180 9 # <U215B> <0>;<GRA>;<MIN>;IGNORE # 181 <18> <U00BC> <0>;<BRE>;<MIN>;IGNORE # 182 � <U215C> <0>;<CIR>;<MIN>;IGNORE # 183 <38> <U215D> <0>;<RNE>;<MIN>;IGNORE # 184 <58> <U215E> <0>;<DAC>;<MIN>;IGNORE # 185 <78> <U00BD> <0>;<CAR>;<MIN>;IGNORE # 186 � <U00BE> <0>;<REU>;<MIN>;IGNORE # 187 � <U2070> <0>;<BAS>;<EMI>;IGNORE # 188 <0S> <U00B9> <1>;<BAS>;<EMI>;IGNORE # 189 � <U00B2> <2>;<BAS>;<EMI>;IGNORE # 190 � <U00B3> <3>;<BAS>;<EMI>;IGNORE # 191 � <U2074> <4>;<BAS>;<EMI>;IGNORE # 192 <4S> <U2075> <5>;<BAS>;<EMI>;IGNORE # 193 <5S> <U2076> <6>;<BAS>;<EMI>;IGNORE # 194 <6S> <U2077> <7>;<BAS>;<EMI>;IGNORE # 195 <7S> <U2078> <8>;<BAS>;<EMI>;IGNORE # 196 <8S> <U2079> <9>;<BAS>;<EMI>;IGNORE # 197 <9S> # <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 � <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á <U00E0> <a>;<GRA>;<MIN>;IGNORE # 201 à <U00E2> <a>;<CIR>;<MIN>;IGNORE # 202 â <U00E3> <a>;<TIL>;<MIN>;IGNORE # 203 ã <U00E4> <a>;<REU>;<MIN>;IGNORE # 204 ä <U00E5> <a>;<RNE>;<MIN>;IGNORE # 205 å <U0103> <a>;<BRE>;<MIN>;IGNORE # 206 <a(> <U0105> <a>;<OGO>;<MIN>;IGNORE # 207 <a;> <U0101> <a>;<MAC>;<MIN>;IGNORE # 208 <a-> <U00E6> <a><e>;<LIG><LIG>;<MIN><MIN>;IGNORE # 209 æ <U0062> <b>;<BAS>;<MIN>;IGNORE # 210 b <U0063> <c>;<BAS>;<MIN>;IGNORE # 211 c <U00E7> <c>;<CDI>;<MIN>;IGNORE # 212 ç <U0107> <c>;<ACA>;<MIN>;IGNORE # 213 <c'> <U0109> <c>;<CIR>;<MIN>;IGNORE # 214 <c/>> <U010D> <c>;<CAR>;<MIN>;IGNORE # 215 <c<> <U010B> <c>;<PCT>;<MIN>;IGNORE # 216 <c.> <U0064> <d>;<BAS>;<MIN>;IGNORE # 217 d <U00F0> <d>;<PCL>;<MIN>;IGNORE # 218 � <U010F> <d>;<CAR>;<MIN>;IGNORE # 219 <d<> <U0111> <d>;<OBL>;<MIN>;IGNORE # 220 <d//> <U0065> <e>;<BAS>;<MIN>;IGNORE # 221 e <U00E9> <e>;<ACA>;<MIN>;IGNORE # 222 é <U00E8> <e>;<GRA>;<MIN>;IGNORE # 223 è <U00EA> <e>;<CIR>;<MIN>;IGNORE # 224 ê <U00EB> <e>;<REU>;<MIN>;IGNORE # 225 ë <U011B> <e>;<CAR>;<MIN>;IGNORE # 226 <e<> <U0117> <e>;<PCT>;<MIN>;IGNORE # 227 <e.> <U0119> <e>;<OGO>;<MIN>;IGNORE # 228 <e;> <U0113> <e>;<MAC>;<MIN>;IGNORE # 229 <e-> <U0066> <f>;<BAS>;<MIN>;IGNORE # 230 f <U0067> <g>;<BAS>;<MIN>;IGNORE # 231 g <U011F> <g>;<BRE>;<MIN>;IGNORE # 232 <g(> <U011D> <g>;<CIR>;<MIN>;IGNORE # 233 <g/>> <U0121> <g>;<PCT>;<MIN>;IGNORE # 234 <g.> <U0123> <g>;<CDI>;<MIN>;IGNORE # 235 <g,> <U0068> <h>;<BAS>;<MIN>;IGNORE # 236 h <U0125> <h>;<CIR>;<MIN>;IGNORE # 237 <h/>> <U0127> <h>;<OBL>;<MIN>;IGNORE # 238 <h//> <U0069> <i>;<BAS>;<MIN>;IGNORE # 239 i <U00ED> <i>;<ACA>;<MIN>;IGNORE # 240 í <U00EC> <i>;<GRA>;<MIN>;IGNORE # 241 ì <U00EE> <i>;<CIR>;<MIN>;IGNORE # 242 î <U00EF> <i>;<REU>;<MIN>;IGNORE # 243 ï <U0131> <i>;<PCL>;<MIN>;IGNORE # 244 <i.> <U0129> <i>;<TIL>;<MIN>;IGNORE # 245 <i?> <U012F> <i>;<OGO>;<MIN>;IGNORE # 246 <i;> <U012B> <i>;<MAC>;<MIN>;IGNORE # 247 <i-> <U0133> <i><j>;<LIG><LIG>;<MIN><MIN>;IGNORE # 248 <ij> <U006A> <j>;<BAS>;<MIN>;IGNORE # 249 j <U0135> <j>;<CIR>;<MIN>;IGNORE # 250 <j/>> <U006B> <k>;<BAS>;<MIN>;IGNORE # 251 k <U0138> <k>;<PCL>;<MIN>;IGNORE # 252 <kk> <U0137> <k>;<CDI>;<MIN>;IGNORE # 253 <k,> <U006C> <l>;<BAS>;<MIN>;IGNORE # 254 l <U013A> <l>;<ACA>;<MIN>;IGNORE # 255 <l'> <U013E> <l>;<CAR>;<MIN>;IGNORE # 256 <l<> <U0142> <l>;<OBL>;<MIN>;IGNORE # 257 <l//> <U013C> <l>;<CDI>;<MIN>;IGNORE # 258 <l,> <U0140> <l>;<PCT>;<MIN>;IGNORE # 259 <l.> <U006D> <m>;<BAS>;<MIN>;IGNORE # 260 m <U006E> <n>;<BAS>;<MIN>;IGNORE # 261 n <U00F1> <n>;<TIL>;<MIN>;IGNORE # 262 ñ <U0149> <n>;<PCL>;<MIN>;IGNORE # 263 <'n> <U0144> <n>;<ACA>;<MIN>;IGNORE # 264 <n'> <U0148> <n>;<CAR>;<MIN>;IGNORE # 265 <n<> <U0146> <n>;<CDI>;<MIN>;IGNORE # 266 <n,> <U014B> <n><g>;<LIG><LIG>;<MIN><MIN>;IGNORE # 267 <ng> <U006F> <o>;<BAS>;<MIN>;IGNORE # 268 o <U00BA> <o>;<PCL>;<EMI>;IGNORE # 269 � <U00F3> <o>;<ACA>;<MIN>;IGNORE # 270 ó <U00F2> <o>;<GRA>;<MIN>;IGNORE # 271 ò <U00F4> <o>;<CIR>;<MIN>;IGNORE # 272 ô <U00F5> <o>;<TIL>;<MIN>;IGNORE # 273 õ <U00F6> <o>;<REU>;<MIN>;IGNORE # 274 ö <U00F8> <o>;<OBL>;<MIN>;IGNORE # 275 ø <U0151> <o>;<DAC>;<MIN>;IGNORE # 276 <o"> <U014D> <o>;<MAC>;<MIN>;IGNORE # 277 <o-> <U0153> <o><e>;<LIG><LIG>;<MIN><MIN>;IGNORE # 278 <oe> <U0070> <p>;<BAS>;<MIN>;IGNORE # 279 p <U0071> <q>;<BAS>;<MIN>;IGNORE # 280 q <U0072> <r>;<BAS>;<MIN>;IGNORE # 281 r <U0155> <r>;<ACA>;<MIN>;IGNORE # 282 <r'> <U0159> <r>;<CAR>;<MIN>;IGNORE # 283 <r<> <U0157> <r>;<CDI>;<MIN>;IGNORE # 284 <r,> <U0073> <s>;<BAS>;<MIN>;IGNORE # 285 s <U015B> <s>;<ACA>;<MIN>;IGNORE # 286 <s'> <U015D> <s>;<CIR>;<MIN>;IGNORE # 287 <s/>> <U0161> <s>;<CAR>;<MIN>;IGNORE # 288 <s<> <U015F> <s>;<CDI>;<MIN>;IGNORE # 289 <s,> <U00DF> <s><s>;<LIG><LIG>;<MIN><MIN>;IGNORE # 290 � <U0074> <t>;<BAS>;<MIN>;IGNORE # 291 t <U0165> <t>;<CAR>;<MIN>;IGNORE # 292 <t<> <U0167> <t>;<OBL>;<MIN>;IGNORE # 293 <t//> <U0163> <t>;<CDI>;<MIN>;IGNORE # 294 <t,> <U0075> <u>;<BAS>;<MIN>;IGNORE # 296 u <U00FA> <u>;<ACA>;<MIN>;IGNORE # 297 ú <U00F9> <u>;<GRA>;<MIN>;IGNORE # 298 ù <U00FB> <u>;<CIR>;<MIN>;IGNORE # 299 û <U00FC> <u>;<REU>;<MIN>;IGNORE # 300 ü <U016D> <u>;<BRE>;<MIN>;IGNORE # 301 <u(> <U016F> <u>;<RNE>;<MIN>;IGNORE # 302 <u0> <U0171> <u>;<DAC>;<MIN>;IGNORE # 303 <u"> <U0169> <u>;<TIL>;<MIN>;IGNORE # 304 <u?> <U0173> <u>;<OGO>;<MIN>;IGNORE # 305 <u;> <U016B> <u>;<MAC>;<MIN>;IGNORE # 306 <u-> <U0076> <v>;<BAS>;<MIN>;IGNORE # 307 v <U0077> <w>;<BAS>;<MIN>;IGNORE # 308 w <U0175> <w>;<CIR>;<MIN>;IGNORE # 309 <w/>> <U0078> <x>;<BAS>;<MIN>;IGNORE # 310 x <U0079> <y>;<BAS>;<MIN>;IGNORE # 311 y <U00FD> <y>;<ACA>;<MIN>;IGNORE # 312 � <U00FF> <y>;<REU>;<MIN>;IGNORE # 313 _ <U0177> <y>;<CIR>;<MIN>;IGNORE # 314 <y/>> <U007A> <z>;<BAS>;<MIN>;IGNORE # 315 z <U017A> <z>;<ACA>;<MIN>;IGNORE # 316 <z'> <U017E> <z>;<CAR>;<MIN>;IGNORE # 317 <z<> <U017C> <z>;<PCT>;<MIN>;IGNORE # 318 <z.> <U00FE> <th>;<BAS>;<MIN>;IGNORE # 318b � # <U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A <U00C1> <a>;<ACA>;<CAP>;IGNORE # 320 Á <U00C0> <a>;<GRA>;<CAP>;IGNORE # 321 À <U00C2> <a>;<CIR>;<CAP>;IGNORE # 322 Â <U00C3> <a>;<TIL>;<CAP>;IGNORE # 323 Ã <U00C4> <a>;<REU>;<CAP>;IGNORE # 324 Ä <U00C5> <a>;<RNE>;<CAP>;IGNORE # 325 Å <U0102> <a>;<BRE>;<CAP>;IGNORE # 326 <A(> <U0104> <a>;<OGO>;<CAP>;IGNORE # 327 <A;> <U0100> <a>;<MAC>;<CAP>;IGNORE # 328 <A-> <U00C6> <a><e>;<LIG><LIG>;<CAP><CAP>;IGNORE # 329 Æ <U0042> <b>;<BAS>;<CAP>;IGNORE # 330 B <U0043> <c>;<BAS>;<CAP>;IGNORE # 331 C <U00C7> <c>;<CDI>;<CAP>;IGNORE # 332 Ç <U0106> <c>;<ACA>;<CAP>;IGNORE # 333 <C'> <U0108> <c>;<CIR>;<CAP>;IGNORE # 334 <C/>> <U010C> <c>;<CAR>;<CAP>;IGNORE # 335 <C>> <U010A> <c>;<PCT>;<CAP>;IGNORE # 336 <C.> <U0044> <d>;<BAS>;<CAP>;IGNORE # 337 D <U00D0> <d>;<PCL>;<CAP>;IGNORE # 338 � <U010E> <d>;<CAR>;<CAP>;IGNORE # 339 <D<> <U0110> <d>;<OBL>;<CAP>;IGNORE # 340 <D//> <U0045> <e>;<BAS>;<CAP>;IGNORE # 341 E <U00C9> <e>;<ACA>;<CAP>;IGNORE # 342 É <U00C8> <e>;<GRA>;<CAP>;IGNORE # 343 È <U00CA> <e>;<CIR>;<CAP>;IGNORE # 344 Ê <U00CB> <e>;<REU>;<CAP>;IGNORE # 345 Ë <U011A> <e>;<CAR>;<CAP>;IGNORE # 346 <E<> <U0116> <e>;<PCT>;<CAP>;IGNORE # 347 <E.> <U0118> <e>;<OGO>;<CAP>;IGNORE # 348 <E;> <U0112> <e>;<MAC>;<CAP>;IGNORE # 349 <E-> <U0046> <f>;<BAS>;<CAP>;IGNORE # 350 F <U0047> <g>;<BAS>;<CAP>;IGNORE # 351 G <U011E> <g>;<BRE>;<CAP>;IGNORE # 352 <G(> <U011C> <g>;<CIR>;<CAP>;IGNORE # 353 <G/>> <U0120> <g>;<PCT>;<CAP>;IGNORE # 354 <G.> <U0122> <g>;<CDI>;<CAP>;IGNORE # 355 <G,> <U0048> <h>;<BAS>;<CAP>;IGNORE # 356 H <U0124> <h>;<CIR>;<CAP>;IGNORE # 357 <H/>> <U0126> <h>;<OBL>;<CAP>;IGNORE # 358 <H//> <U0049> <i>;<BAS>;<CAP>;IGNORE # 359 I <U00CD> <i>;<ACA>;<CAP>;IGNORE # 360 Í <U00CC> <i>;<GRA>;<CAP>;IGNORE # 361 Ì <U00CE> <i>;<CIR>;<CAP>;IGNORE # 362 Î <U00CF> <i>;<REU>;<CAP>;IGNORE # 363 Ï <U0130> <i>;<PCL>;<CAP>;IGNORE # 364 <I.> <U0128> <i>;<TIL>;<CAP>;IGNORE # 365 <I?> <U012E> <i>;<OGO>;<CAP>;IGNORE # 366 <I;> <U012A> <i>;<MAC>;<CAP>;IGNORE # 367 <I-> <U0132> <i><j>;<LIG><LIG>;<CAP><CAP>;IGNORE # 368 <IJ> <U004A> <j>;<BAS>;<CAP>;IGNORE # 369 J <U0134> <j>;<CIR>;<CAP>;IGNORE # 370 <J/>> <U004B> <k>;<BAS>;<CAP>;IGNORE # 371 K <U0136> <k>;<CDI>;<CAP>;IGNORE # 372 <K,> <U004C> <l>;<BAS>;<CAP>;IGNORE # 373 L <U0139> <l>;<ACA>;<CAP>;IGNORE # 374 <L'> <U013D> <l>;<CAR>;<CAP>;IGNORE # 375 <L<> <U0141> <l>;<OBL>;<CAP>;IGNORE # 376 <L//> <U013B> <l>;<CDI>;<CAP>;IGNORE # 377 <L,> <U013F> <l>;<PCT>;<CAP>;IGNORE # 378 <L.> <U004D> <m>;<BAS>;<CAP>;IGNORE # 379 M <U004E> <n>;<BAS>;<CAP>;IGNORE # 380 N <U00D1> <n>;<TIL>;<CAP>;IGNORE # 381 Ñ <U0143> <n>;<ACA>;<CAP>;IGNORE # 382 <N'> <U0147> <n>;<CAR>;<CAP>;IGNORE # 383 <N<> <U0145> <n>;<CDI>;<CAP>;IGNORE # 384 <N,> <U014A> <n><g>;<LIG><LIG>;<CAP><CAP>;IGNORE # 385 <NG> <U004F> <o>;<BAS>;<CAP>;IGNORE # 386 O <U00D3> <o>;<ACA>;<CAP>;IGNORE # 387 Ó <U00D2> <o>;<GRA>;<CAP>;IGNORE # 388 Ò <U00D4> <o>;<CIR>;<CAP>;IGNORE # 389 Ô <U00D5> <o>;<TIL>;<CAP>;IGNORE # 390 Õ <U00D6> <o>;<REU>;<CAP>;IGNORE # 391 Ö <U00D8> <o>;<OBL>;<CAP>;IGNORE # 392 Ø <U0150> <o>;<DAC>;<CAP>;IGNORE # 393 <O"> <U014C> <o>;<MAC>;<CAP>;IGNORE # 394 <O-> <U0152> <o><e>;<LIG><LIG>;<CAP><CAP>;IGNORE # 395 <OE> <U0050> <p>;<BAS>;<CAP>;IGNORE # 396 P <U0051> <q>;<BAS>;<CAP>;IGNORE # 397 Q <U0052> <r>;<BAS>;<CAP>;IGNORE # 398 R <U0154> <r>;<ACA>;<CAP>;IGNORE # 399 <R'> <U0158> <r>;<CAR>;<CAP>;IGNORE # 400 <R<> <U0156> <r>;<CDI>;<CAP>;IGNORE # 401 <R,> <U0053> <s>;<BAS>;<CAP>;IGNORE # 402 S <U015A> <s>;<ACA>;<CAP>;IGNORE # 403 <S'> <U015C> <s>;<CIR>;<CAP>;IGNORE # 404 <S/>> <U0160> <s>;<CAR>;<CAP>;IGNORE # 405 <S<> <U015E> <s>;<CDI>;<CAP>;IGNORE # 406 <S,> <U0054> <t>;<BAS>;<CAP>;IGNORE # 407 T <U0164> <t>;<CAR>;<CAP>;IGNORE # 408 <T<> <U0166> <t>;<OBL>;<CAP>;IGNORE # 409 <T//> <U0162> <t>;<CDI>;<CAP>;IGNORE # 410 <T,> <U0055> <u>;<BAS>;<CAP>;IGNORE # 412 U <U00DA> <u>;<ACA>;<CAP>;IGNORE # 413 Ú <U00D9> <u>;<GRA>;<CAP>;IGNORE # 414 Ù <U00DB> <u>;<CIR>;<CAP>;IGNORE # 415 Û <U00DC> <u>;<REU>;<CAP>;IGNORE # 416 Ü <U016C> <u>;<BRE>;<CAP>;IGNORE # 417 <U(> <U016E> <u>;<RNE>;<CAP>;IGNORE # 418 <U0> <U0170> <u>;<DAC>;<CAP>;IGNORE # 419 <U"> <U0168> <u>;<TIL>;<CAP>;IGNORE # 420 <U?> <U0172> <u>;<OGO>;<CAP>;IGNORE # 421 <U;> <U016A> <u>;<MAC>;<CAP>;IGNORE # 422 <U-> <U0056> <v>;<BAS>;<CAP>;IGNORE # 423 V <U0057> <w>;<BAS>;<CAP>;IGNORE # 424 W <U0174> <w>;<CIR>;<CAP>;IGNORE # 425 <W/>> <U0058> <x>;<BAS>;<CAP>;IGNORE # 426 X <U0059> <y>;<BAS>;<CAP>;IGNORE # 427 Y <U00DD> <y>;<ACA>;<CAP>;IGNORE # 428 � <U0176> <y>;<CIR>;<CAP>;IGNORE # 429 <Y/>> <U0178> <y>;<REU>;<CAP>;IGNORE # 430 <Y:> <U005A> <z>;<BAS>;<CAP>;IGNORE # 431 Z <U0179> <z>;<ACA>;<CAP>;IGNORE # 432 <Z'> <U017D> <z>;<CAR>;<CAP>;IGNORE # 433 <Z<> <U017B> <z>;<PCT>;<CAP>;IGNORE # 434 <Z.> <U00DE> <th>;<BAS>;<CAP>;IGNORE # 411 � order_start <ARABINT>;forward;forward;forward;forward,position
<U0660> <0>;<BAS>;<MIN>;IGNORE <U06F0> <0>;<PCL>;<MIN>;IGNORE <U0661> <1>;<BAS>;<MIN>;IGNORE <U06F1> <1>;<PCL>;<MIN>;IGNORE <U0662> <2>;<BAS>;<MIN>;IGNORE <U06F2> <2>;<PCL>;<MIN>;IGNORE <U0663> <3>;<BAS>;<MIN>;IGNORE <U06F3> <3>;<PCL>;<MIN>;IGNORE <U0664> <4>;<BAS>;<MIN>;IGNORE <U06F4> <4>;<PCL>;<MIN>;IGNORE <U0665> <5>;<BAS>;<MIN>;IGNORE <U06F5> <5>;<PCL>;<MIN>;IGNORE <U0666> <6>;<BAS>;<MIN>;IGNORE <U06F6> <6>;<PCL>;<MIN>;IGNORE <U0667> <7>;<BAS>;<MIN>;IGNORE <U06F7> <7>;<PCL>;<MIN>;IGNORE <U0668> <8>;<BAS>;<MIN>;IGNORE <U06F8> <8>;<PCL>;<MIN>;IGNORE <U0669> <9>;<BAS>;<MIN>;IGNORE <U06F9> <9>;<PCL>;<MIN>;IGNORE
<U0621> <hamza>;<BAS>;<MIN>;IGNORE <U0622> <alef>;<AMA>;<MIN>;IGNORE <U0623> <alef>;<AHA>;<MIN>;IGNORE <U0625> <alef>;<AHS>;<MIN>;IGNORE <U0627> <alef>;<BAS>;<MIN>;IGNORE <U0628> <beh>;<BAS>;<MIN>;IGNORE <U067E> <peh>;<BAS>;<MIN>;IGNORE <U0629> <teh_marbuta>;<BAS>;<MIN>;IGNORE <U062A> <teh>;<BAS>;<MIN>;IGNORE <U0679> <tteh>;<BAS>;<MIN>;IGNORE <U062B> <theh>;<BAS>;<MIN>;IGNORE <U062C> <jeem>;<BAS>;<MIN>;IGNORE <U0686> <tcheh>;<BAS>;<MIN>;IGNORE <U062D> <hah>;<BAS>;<MIN>;IGNORE <U062E> <khah>;<BAS>;<MIN>;IGNORE <U062F> <dal>;<BAS>;<MIN>;IGNORE <U0688> <ddal>;<BAS>;<MIN>;IGNORE <U0630> <thal>;<BAS>;<MIN>;IGNORE <U0631> <reh>;<BAS>;<MIN>;IGNORE <U0691> <rreh>;<BAS>;<MIN>;IGNORE <U0632> <zain>;<BAS>;<MIN>;IGNORE <U0698> <jeh>;<BAS>;<MIN>;IGNORE <U0633> <seen>;<BAS>;<MIN>;IGNORE <U0634> <sheen>;<BAS>;<MIN>;IGNORE <U0635> <sad>;<BAS>;<MIN>;IGNORE <U0636> <dad>;<BAS>;<MIN>;IGNORE <U0637> <tah>;<BAS>;<MIN>;IGNORE <U0638> <zah>;<BAS>;<MIN>;IGNORE <U0639> <ain>;<BAS>;<MIN>;IGNORE <U063A> <ghain>;<BAS>;<MIN>;IGNORE <U0641> <feh>;<BAS>;<MIN>;IGNORE <U0642> <qaf>;<BAS>;<MIN>;IGNORE <U0643> <kaf>;<BAS>;<MIN>;IGNORE <U06A9> <keheh>;<BAS>;<MIN>;IGNORE <U06AF> <gaf>;<BAS>;<MIN>;IGNORE <U0644> <lam>;<BAS>;<MIN>;IGNORE <U0645> <meem>;<BAS>;<MIN>;IGNORE <U0646> <noon>>;<BAS>;<MIN>;IGNORE <U06BA> <noon_ghunna>;<BAS>;<MIN>;IGNORE <U0647> <heh>;<BAS>;<MIN>;IGNORE <U06C0> <heh_yeh>;<BAS>;<MIN>;IGNORE <U0624> <waw>;<AHW>;<MIN>;IGNORE <U0648> <waw>;<BAS>;<MIN>;IGNORE <U0649> <alef_maksura>;<BAS>;<MIN>;IGNORE <U0626> <alef_maksura><hamza>;<BAS><BAS>;<MIN><MIN>;IGNORE <U064A> <alef_maksura>;<AYE>;<MIN>;IGNORE <U06D3> <yeh_barree>;<YBA>;<MIN>;IGNORE <U06D2> <yeh_barree>;<BAS>;<MIN>;IGNORE
order_start <ARABFOR>;backward;backward;backward;forward,position
<UFE80> <hamza>;<BAS>;<AIS>;IGNORE
<UFE81> <alef>;<AMA>;<AIS>;IGNORE <UFE82> <alef>;<AMA>;<AFI>;IGNORE <UFE83> <alef>;<AHA>;<AIS>;IGNORE <UFE84> <alef>;<AHA>;<AFI>;IGNORE <UFE87> <alef>;<AHS>;<AIS>;IGNORE <UFE88> <alef>;<AHS>;<AFI>;IGNORE <UFE8D> <alef>;<BAS>;<AIS>;IGNORE <UFE8E> <alef>;<BAS>;<AFI>;IGNORE
<UFE8F> <beh>;<BAS>;<AIS>;IGNORE <UFE90> <beh>;<BAS>;<AFI>;IGNORE <UFE91> <beh>;<BAS>;<AII>;IGNORE <UFE92> <beh>;<BAS>;<AME>;IGNORE
<UFB56> <peh>;<BAS>;<AIS>;IGNORE <UFB57> <peh>;<BAS>;<AFI>;IGNORE <UFB58> <peh>;<BAS>;<AII>;IGNORE <UFB59> <peh>;<BAS>;<AME>;IGNORE
<UFE93> <teh_marbuta>;<BAS>;<AIS>;IGNORE <UFE94> <teh_marbuta>;<BAS>;<AFI>;IGNORE
<UFE95> <teh>;<BAS>;<AIS>;IGNORE <UFE96> <teh>;<BAS>;<AFI>;IGNORE <UFE97> <teh>;<BAS>;<AII>;IGNORE <UFE98> <teh>;<BAS>;<AME>;IGNORE
<UFB66> <tteh>;<BAS>;<AIS>;IGNORE <UFB67> <tteh>;<BAS>;<AFI>;IGNORE <UFB68> <tteh>;<BAS>;<AII>;IGNORE <UFB69> <tteh>;<BAS>;<AME>;IGNORE
<UFE99> <theh>;<BAS>;<AIS>;IGNORE <UFE9A> <theh>;<BAS>;<AFI>;IGNORE <UFE9B> <theh>;<BAS>;<AII>;IGNORE <UFE9C> <theh>;<BAS>;<AME>;IGNORE
<UFE9D> <jeem>;<BAS>;<AIS>;IGNORE <UFE9E> <jeem>;<BAS>;<AFI>;IGNORE <UFE9F> <jeem>;<BAS>;<AII>;IGNORE <UFEA0> <jeem>;<BAS>;<AME>;IGNORE
<UFB7A> <tcheh>;<BAS>;<AIS>;IGNORE <UFB7B> <tcheh>;<BAS>;<AFI>;IGNORE <UFB7C> <tcheh>;<BAS>;<AII>;IGNORE <UFB7D> <tcheh>;<BAS>;<AME>;IGNORE
<UFEA1> <hah>;<BAS>;<AIS>;IGNORE <UFEA2> <hah>;<BAS>;<AFI>;IGNORE <UFEA3> <hah>;<BAS>;<AII>;IGNORE <UFEA4> <hah>;<BAS>;<AME>;IGNORE
<UFEA5> <khah>;<BAS>;<AIS>;IGNORE <UFEA6> <khah>;<BAS>;<AFI>;IGNORE <UFEA7> <khah>;<BAS>;<AII>;IGNORE <UFEA8> <khah>;<BAS>;<AME>;IGNORE
<UFEA9> <dal>;<BAS>;<AIS>;IGNORE <UFEAA> <dal>;<BAS>;<AFI>;IGNORE
<UFB88> <ddal>;<BAS>;<AIS>;IGNORE <UFB89> <ddal>;<BAS>;<AFI>;IGNORE
<UFEAB> <thal>;<BAS>;<AIS>;IGNORE <UFEAC> <thal>;<BAS>;<AFI>;IGNORE
<UFEAD> <reh>;<BAS>;<AIS>;IGNORE <UFEAE> <reh>;<BAS>;<AFI>;IGNORE
<UFB8C> <rreh>;<BAS>;<AIS>;IGNORE <UFB8D> <rreh>;<BAS>;<AFI>;IGNORE
<UFEAF> <zain>;<BAS>;<AIS>;IGNORE <UFEB0> <zain>;<BAS>;<AFI>;IGNORE
<UFB8A> <jeh>;<BAS>;<AIS>;IGNORE <UFB8B> <jeh>;<BAS>;<AFI>;IGNORE
<UFEB1> <seen>;<BAS>;<AIS>;IGNORE <UFEB2> <seen>;<BAS>;<AFI>;IGNORE <UFEB3> <seen>;<BAS>;<AII>;IGNORE <UFEB4> <seen>;<BAS>;<AME>;IGNORE
<UFEB5> <sheen>;<BAS>;<AIS>;IGNORE <UFEB6> <sheen>;<BAS>;<AFI>;IGNORE <UFEB7> <sheen>;<BAS>;<AII>;IGNORE <UFEB8> <sheen>;<BAS>;<AME>;IGNORE
<UFEB9> <sad>;<BAS>;<AIS>;IGNORE <UFEBA> <sad>;<BAS>;<AFI>;IGNORE <UFEBB> <sad>;<BAS>;<AII>;IGNORE <UFEBC> <sad>;<BAS>;<AME>;IGNORE
<UFEBD> <dad>;<BAS>;<AIS>;IGNORE <UFEBE> <dad>;<BAS>;<AFI>;IGNORE <UFEBF> <dad>;<BAS>;<AII>;IGNORE <UFEC0> <dad>;<BAS>;<AME>;IGNORE
<UFEC1> <tah>;<BAS>;<AIS>;IGNORE <UFEC2> <tah>;<BAS>;<AFI>;IGNORE <UFEC3> <tah>;<BAS>;<AII>;IGNORE <UFEC4> <tah>;<BAS>;<AME>;IGNORE
<UFEC5> <zah>;<BAS>;<AIS>;IGNORE <UFEC6> <zah>;<BAS>;<AFI>;IGNORE <UFEC7> <zah>;<BAS>;<AII>;IGNORE <UFEC8> <zah>;<BAS>;<AME>;IGNORE
<UFEC9> <ain>;<BAS>;<AIS>;IGNORE <UFECA> <ain>;<BAS>;<AFI>;IGNORE <UFECB> <ain>;<BAS>;<AII>;IGNORE <UFECC> <ain>;<BAS>;<AME>;IGNORE
<UFECD> <ghain>;<BAS>;<AIS>;IGNORE <UFECE> <ghain>;<BAS>;<AFI>;IGNORE <UFECF> <ghain>;<BAS>;<AII>;IGNORE <UFED0> <ghain>;<BAS>;<AME>;IGNORE
<UFED1> <feh>;<BAS>;<AIS>;IGNORE <UFED2> <feh>;<BAS>;<AFI>;IGNORE <UFED3> <feh>;<BAS>;<AII>;IGNORE <UFED4> <feh>;<BAS>;<AME>;IGNORE
<UFED5> <qaf>;<BAS>;<AIS>;IGNORE <UFED6> <qaf>;<BAS>;<AFI>;IGNORE <UFED7> <qaf>;<BAS>;<AII>;IGNORE <UFED8> <qaf>;<BAS>;<AME>;IGNORE
<UFED9> <kaf>;<BAS>;<AIS>;IGNORE <UFEDA> <kaf>;<BAS>;<AFI>;IGNORE <UFEDB> <kaf>;<BAS>;<AII>;IGNORE <UFEDC> <kaf>;<BAS>;<AME>;IGNORE
<UFB8E> <keheh>;<BAS>;<AIS>;IGNORE <UFB8F> <keheh>;<BAS>;<AFI>;IGNORE <UFB90> <keheh>;<BAS>;<AII>;IGNORE <UFB91> <keheh>;<BAS>;<AME>;IGNORE
<UFB92> <gaf>;<BAS>;<AIS>;IGNORE <UFB93> <gaf>;<BAS>;<AFI>;IGNORE <UFB94> <gaf>;<BAS>;<AII>;IGNORE <UFB95> <gaf>;<BAS>;<AME>;IGNORE
<UFEDD> <lam>;<BAS>;<AIS>;IGNORE <UFEDE> <lam>;<BAS>;<AFI>;IGNORE <UFEDF> <lam>;<BAS>;<AII>;IGNORE <UFEE0> <lam>;<BAS>;<AME>;IGNORE
<UFEE1> <meem>;<BAS>;<AIS>;IGNORE <UFEE2> <meem>;<BAS>;<AFI>;IGNORE <UFEE3> <meem>;<BAS>;<AII>;IGNORE <UFEE4> <meem>;<BAS>;<AME>;IGNORE
<UFEE5> <noon>;<BAS>;<AIS>;IGNORE <UFEE6> <noon>;<BAS>;<AFI>;IGNORE <UFEE7> <noon>;<BAS>;<AII>;IGNORE <UFEE8> <noon>;<BAS>;<AME>;IGNORE
<UFB9E> <noon_ghunna>;<BAS>;<AIS>;IGNORE <UFB9F> <noon_ghunna>;<BAS>;<AFI>;IGNORE
<UFEE9> <heh>;<BAS>;<AIS>;IGNORE <UFEEA> <heh>;<BAS>;<AFI>;IGNORE <UFEEB> <heh>;<BAS>;<AII>;IGNORE <UFEEC> <heh>;<BAS>;<AME>;IGNORE
<UFBA4> <heh_yeh>;<BAS>;<AIS>;IGNORE <UFBA5> <heh_yeh>;<BAS>;<AFI>;IGNORE
<UFE85> <waw>;<AHW>;<AIS>;IGNORE <UFE86> <waw>;<AHW>;<AFI>;IGNORE
<UFEED> <waw>;<BAS>;<AIS>;IGNORE <UFEEE> <waw>;<BAS>;<AFI>;IGNORE
<UFEEF> <alef_maksura>;<BAS>;<AIS>;IGNORE <UFEF0> <alef_maksura>;<BAS>;<AFI>;IGNORE
<UFE89> <alef_maksura><hamza>;<BAS><BAS>;<AIS><AIS>;IGNORE <UFE8A> <alef_maksura><hamza>;<BAS><BAS>;<AFI><AIS>;IGNORE <UFE8B> <alef_maksura><hamza>;<BAS><BAS>;<AII><AIS>;IGNORE <UFE8C> <alef_maksura><hamza>;<BAS><BAS>;<AME><AIS>;IGNORE
<UFEF1> <alef_maksura>;<AYE>;<AIS>;IGNORE <UFEF2> <alef_maksura>;<AYE>;<AFI>;IGNORE <UFEF3> <alef_maksura>;<AYE>;<AII>;IGNORE <UFEF4> <alef_maksura>;<AYE>;<AME>;IGNORE
<UFBB0> <yeh_barree>;<YBA>;<AIS>;IGNORE <UFBB1> <yeh_barree>;<YBA>;<AFI>;IGNORE
<UFBAE> <yeh_barree>;<BAS>;<AIS>;IGNORE <UFBAF> <yeh_barree>;<BAS>;<AFI>;IGNORE
<UFEF5> <lam><alef>;<BAS><AMA>;<AIS><AFI>;IGNORE <UFEF6> <lam><alef>;<BAS><AMA>;<AFI>;<AFI>;IGNORE
<UFEF7> <lam><alef>;<BAS><AHA>;<AIS>;<AFI>;IGNORE <UFEF8> <lam><alef>;<BAS><AHA>;<AFI>;<AFI>;IGNORE
<UFEF9> <lam><alef>;<BAS><AHS>;<AIS>;<AFI>;IGNORE <UFEFA> <lam><alef>;<BAS><AHS>;<AFI><AFI>;IGNORE
<UFEFB> <lam><alef>;<BAS><BAS>;<AIS><AFI>;IGNORE <UFEFC> <lam><alef>;<BAS><BAS>;<AFI><AFI>;IGNORE order_start <HEBREU>;forward;forward;forward;forward,position
<U05D0> <alef>;<BAS>;IGNORE;IGNORE <U05D1> <bet>;<BAS>;IGNORE;IGNORE <U05D2> <gimel>;<BAS>;IGNORE;IGNORE <U05D3> <dalet>;<BAS>;IGNORE;IGNORE <U05D4> <he>;<BAS>;IGNORE;IGNORE <U05D5> <vav>;<BAS>;IGNORE;IGNORE <U05D6> <zayin>;<BAS>;IGNORE;IGNORE <U05D7> <het>;<BAS>;IGNORE;IGNORE <U05D8> <tet>;<BAS>;IGNORE;IGNORE <U05D9> <yod>;<BAS>;IGNORE;IGNORE <U05DA> <kaf_fin>;<BAS>;IGNORE;IGNORE <U05DB> <kaf>;<BAS>;IGNORE;IGNORE <U05DC> <lamed>;<BAS>;IGNORE;IGNORE <U05DD> <mem_fin>;<BAS>;IGNORE;IGNORE <U05DE> <mem>;<BAS>;IGNORE;IGNORE <U05DF> <nun_fin>;<BAS>;IGNORE;IGNORE <U05E0> <nun>;<BAS>;IGNORE;IGNORE <U05E1> <samekh>;<BAS>;IGNORE;IGNORE <U05E2> <ayin>;<BAS>;IGNORE;IGNORE <U05E3> <pe_fin>;<BAS>;IGNORE;IGNORE <U05E4> <pe>;<BAS>;IGNORE;IGNORE <U05E5> <tsadi_fin>;<BAS>;IGNORE;IGNORE <U05E6> <tsadi>;<BAS>;IGNORE;IGNORE <U05E7> <qof>;<BAS>;IGNORE;IGNORE <U05E8> <resh>;<BAS>;IGNORE;IGNORE <U05E9> <shin>;<BAS>;IGNORE;IGNORE <U05EA> <tav>;<BAS>;IGNORE;IGNORE order_start <GREC>;forward;backward;forward
<U0391> <ALPHA>;<BAS>;<CAP>;IGNORE <U03B1> <ALPHA>;<BAS>;<AMI>;IGNORE <U0386> <ALPHA>;<TNS>;<CAP>;IGNORE <U03AC> <ALPHA>;<TNS>;<AMI>;IGNORE <U0392> <BETA>;<BAS>;<CAP>;IGNORE <U03B2> <BETA>;<BAS>;<AMI>;IGNORE <U03D0> <BETA>;<PCL>;<AMI>;IGNORE <U0393> <GAMMA>;<BAS>;<CAP>;IGNORE <U03B3> <GAMMA>;<BAS>;<AMI>;IGNORE <U03DC> <GAMMA>;<PCL>;<CAP>;IGNORE # digamma copte <U0394> <DELTA>;<BAS>;<CAP>;IGNORE <U03B4> <DELTA>;<BAS>;<AMI>;IGNORE <U03EA> <DELTA>;<PCL>;<CAP>;IGNORE # GANGIA COPTE <U03EB> <DELTA>;<BAS>;<AMI>;IGNORE # gangia copte <U0395> <EPSILON>;<BAS>;<CAP>;IGNORE <U03B5> <EPSILON>;<BAS>;<AMI>;IGNORE <U0388> <EPSILON>;<TNS>;<CAP>;IGNORE <U03AD> <EPSILON>;<TNS>;<AMI>;IGNORE <U0396> <ZETA>;<BAS>;<CAP>;IGNORE <U03B6> <ZETA>;<BAS>;<AMI>;IGNORE <U03E8> <ZETA>;<PCL>;<CAP>;IGNORE # HORI COPTE <U03E9> <ZETA>;<PCL>;<AMI>;IGNORE # hori copte <U0397> <ETA>;<BAS>;<CAP>;IGNORE <U03B7> <ETA>;<BAS>;<AMI>;IGNORE <U0389> <ETA>;<TNS>;<CAP>;IGNORE <U03AE> <ETA>;<TNS>;<AMI>;IGNORE <U0398> <THETA>;<BAS>;<CAP>;IGNORE <U03B8> <THETA>;<BAS>;<AMI>;IGNORE <U03D1> <THETA>;<PCL>;<AMI>;IGNORE <U0399> <IOTA>;<BAS>;<CAP>;IGNORE <U03B9> <IOTA>;<BAS>;<AMI>;IGNORE <U038A> <IOTA>;<TNS>;<CAP>;IGNORE <U03AF> <IOTA>;<TNS>;<AMI>;IGNORE <U03AA> <IOTA>;<DLT>;<CAP>;IGNORE <U03CA> <IOTA>;<DLT>;<AMI>;IGNORE <U0390> <IOTA>;<DTT>;<AMI>;IGNORE <U03F3> <IOTA>;<OGO>;<AMI>;IGNORE # yot <U039A> <KAPPA>;<BAS>;<CAP>;IGNORE <U03BA> <KAPPA>;<BAS>;<AMI>;IGNORE <U03DE> <KAPPA>;<PCL>;<CAP>;IGNORE # koppa copte <U03F0> <KAPPA>;<PCL>;<AMI>;IGNORE <U03E6> <KAPPA>;<LIG>;<CAP>;IGNORE # KHEI COPTE <U03E7> <KAPPA>;<LIG>;<AMI>;IGNORE # khei copte <U039B> <LAMBDA>;<BAS>;<CAP>;IGNORE <U03BB> <LAMBDA>;<BAS>;<CAP>;IGNORE <U039C> <MU>;<BAS>;<CAP>;IGNORE <U03BC> <MU>;<BAS>;<AMI>;IGNORE <U039D> <NU>;<BAS>;<CAP>;IGNORE <U03BD> <NU>;<BAS>;<AMI>;IGNORE <U039E> <XI>;<BAS>;<CAP>;IGNORE <U03BE> <XI>;<BAS>;<AMI>;IGNORE <U039F> <OMICRON>;<BAS>;<CAP>;IGNORE <U03BF> <OMICRON>;<BAS>;<AMI>;IGNORE <U038C> <OMICRON>;<TNS>;<CAP>;IGNORE <U03CC> <OMICRON>;<TNS>;<AMI>;IGNORE <U03A0> <PI>;<BAS>;<CAP>;IGNORE <U03C0> <PI>;<BAS>;<AMI>;IGNORE <U03D6> <PI>;<PCL>;<AMI>;IGNORE <U03A1> <RHO>;<BAS>;<CAP>;IGNORE <U03C1> <RHO>;<BAS>;<CAP>;IGNORE <U03F1> <RHO>;<PCL>;<AMI>;IGNORE <U03A3> <SIGMA>;<BAS>;<CAP>;IGNORE <U03C3> <SIGMA>;<BAS>;<AMI>;IGNORE <U03C2> <SIGMA>;<PCL>;<AMI>;IGNORE <U03DA> <SIGMA>;<PCL>;<CAP>;IGNORE # STIGMA ARCH. <U03EC> <SIGMA>;<LIG>;<CAP>;IGNORE # SHIMA COPTE <U03ED> <SIGMA>;<LIG>;<AMI>;IGNORE # shima copte <U03F2> <SIGMA>;<OGO>;<AMI>;IGNORE <U03A4> <TAU>;<BAS>;<CAP>;IGNORE <U03C4> <TAU>;<BAS>;<AMI>;IGNORE <U03EE> <TAU>;<PCL>;<CAP>;IGNORE # DEI COPTE <U03EF> <TAU>;<PCL>;<AMI>;IGNORE # dei copte <U03A5> <UPSILON>;<BAS>;<CAP>;IGNORE <U03C5> <UPSILON>;<BAS>;<AMI>;IGNORE <U038E> <UPSILON>;<TNS>;<CAP>;IGNORE <U03CD> <UPSILON>;<TNS>;<AMI>;IGNORE <U03AB> <UPSILON>;<DLT>;<CAP>;IGNORE <U03CB> <UPSILON>;<DLT>;<AMI>;IGNORE <U03B0> <UPSILON>;<DTT>;<AMI>;IGNORE <U03D4> <UPSILON>;<DTT>;<CAP>;IGNORE <U03D2> <UPSILON>;<OGO>;<CAP>;IGNORE <U03D3> <UPSILON>;<MAC>;<CAP>;IGNORE <U03A6> <PHI>;<BAS>;<CAP>;IGNORE <U03C6> <PHI>;<BAS>;<AMI>;IGNORE <U03D5> <PHI>;<PCL>;<AMI>;IGNORE <U03E4> <PHI>;<LIG>;<CAP>;IGNORE # FEI COPTE <U03E5> <PHI>;<LIG>;<AMI>;IGNORE # fei copte <U03A7> <KHI>;<BAS>;<CAP>;IGNORE <U03C7> <KHI>;<BAS>;<AMI>;IGNORE <U03E0> <KHI>;<PCL>;<CAP>;IGNORE # sampi copte <U03A8> <PSI>;<BAS>;<CAP>;IGNORE <U03C8> <PSI>;<BAS>;<AMI>;IGNORE <U03E2> <PSI>;<PCL>;<CAP>;IGNORE # SHEI COPTE <U03E3> <PSI>;<PCL>;<AMI>;IGNORE # shei copte <U03A9> <OMEGA>;<BAS>;<CAP>;IGNORE <U03C9> <OMEGA>;<BAS>;<AMI>;IGNORE <U038F> <OMEGA>;<TNS>;<CAP>;IGNORE <U03CE> <OMEGA>;<TNS>;<AMI>;IGNORE order_start <CYRIL>;forward;forward;forward;forward,position
<U0430> <CYR-A>;<BAS>;<MIN>;IGNORE <U0410> <CYR-A>;<BAS>;<CAP>;IGNORE <U0431> <CYR-BE>;<BAS>;<MIN>;IGNORE <U0411> <CYR-BE>;<BAS>;<CAP>;IGNORE <U0432> <CYR-VE>;<BAS>;<MIN>;IGNORE <U0412> <CYR-VE>;<BAS>;<CAP>;IGNORE <U0433> <CYR-GHE>;<BAS>;<MIN>;IGNORE <U0413> <CYR-GHE>;<BAS>;<CAP>;IGNORE <U0434> <CYR-DE>;<BAS>;<MIN>;IGNORE <U0414> <CYR-DE>;<BAS>;<CAP>;IGNORE <U0453> <CYR-GZHE>;<BAS>;<MIN>;IGNORE <U0403> <CYR-GZHE>;<BAS>;<CAP>;IGNORE <U0452> <CYR-DJE>;<BAS>;<MIN>;IGNORE <U0402> <CYR-DJE>;<BAS>;<CAP>;IGNORE <U0435> <CYR-IE>;<BAS>;<MIN>;IGNORE <U0415> <CYR-IE>;<BAS>;<CAP>;IGNORE <U0454> <UKR-IE>;<BAS>;<MIN>;IGNORE <U0404> <UKR-IE>;<BAS>;<CAP>;IGNORE <U0451> <CYR-IO>;<BAS>;<MIN>;IGNORE <U0401> <CYR-IO>;<BAS>;<CAP>;IGNORE <U0436> <CYR-ZHE>;<BAS>;<MIN>;IGNORE <U0416> <CYR-ZHE>;<BAS>;<CAP>;IGNORE <U0437> <CYR-ZE>;<BAS>;<MIN>;IGNORE <U0417> <CYR-ZE>;<BAS>;<CAP>;IGNORE <U0455> <CYR-DZE>;<BAS>;<MIN>;IGNORE <U0405> <CYR-DZE>;<BAS>;<CAP>;IGNORE <U0438> <CYR-I>;<BAS>;<MIN>;IGNORE <U0418> <CYR-I>;<BAS>;<CAP>;IGNORE <U0456> <UKR-I>;<BAS>;<MIN>;IGNORE <U0406> <UKR-I>;<BAS>;<MIN>;IGNORE <U0457> <UKR-YI>;<BAS>;<MIN>;IGNORE <U0407> <UKR-YI>;<BAS>;<CAP>;IGNORE <U0439> <CYR-IBRE>;<BAS>;<MIN>;IGNORE <U0419> <CYR-IBRE>;<BAS>;<CAP>;IGNORE <U0458> <CYR-JE>;<BAS>;<MIN>;IGNORE <U0408> <CYR-JE>;<BAS>;<CAP>;IGNORE <U043A> <CYR-KA>;<BAS>;<MIN>;IGNORE <U041A> <CYR-KA>;<BAS>;<CAP>;IGNORE <U043B> <CYR-EL>;<BAS>;<MIN>;IGNORE <U041B> <CYR-EL>;<BAS>;<CAP>;IGNORE <U0459> <CYR-LJE>;<BAS>;<MIN>;IGNORE <U0409> <CYR-LJE>;<BAS>;<CAP>;IGNORE <U043C> <CYR-EM>;<BAS>;<MIN>;IGNORE <U041C> <CYR-EM>;<BAS>;<CAP>;IGNORE <U043D> <CYR-EN>;<BAS>;<MIN>;IGNORE <U041D> <CYR-EN>;<BAS>;<CAP>;IGNORE <U045A> <CYR-NJE>;<BAS>;<MIN>;IGNORE <U040A> <CYR-NJE>;<BAS>;<CAP>;IGNORE <U043E> <CYR-O>;<BAS>;<MIN>;IGNORE <U041E> <CYR-O>;<BAS>;<CAP>;IGNORE <U043F> <CYR-PE>;<BAS>;<MIN>;IGNORE <U041F> <CYR-PE>;<BAS>;<CAP>;IGNORE <U0440> <CYR-ER>;<BAS>;<MIN>;IGNORE <U0420> <CYR-ER>;<BAS>;<CAP>;IGNORE <U0441> <CYR-ES>;<BAS>;<MIN>;IGNORE <U0421> <CYR-ES>;<BAS>;<CAP>;IGNORE <U0442> <CYR-TE>;<BAS>;<MIN>;IGNORE <U0422> <CYR-TE>;<BAS>;<CAP>;IGNORE <U045C> <CYR-KJE>;<BAS>;<MIN>;IGNORE <U040C> <CYR-KJE>;<BAS>;<CAP>;IGNORE <U045B> <CYR-TSHE>;<BAS>;<MIN>;IGNORE <U040B> <CYR-TSHE>;<BAS>;<CAP>;IGNORE <U0443> <CYR-OU>;<BAS>;<MIN>;IGNORE <U0423> <CYR-OU>;<BAS>;<CAP>;IGNORE <U045E> <CYR-OUBRE>;<BAS>;<MIN>;IGNORE <U040E> <CYR-OUBRE>;<BAS>;<CAP>;IGNORE <U0444> <CYR-EF>;<BAS>;<MIN>;IGNORE <U0424> <CYR-EF>;<BAS>;<CAP>;IGNORE <U0445> <CYR-HA>;<BAS>;<MIN>;IGNORE <U0425> <CYR-HA>;<BAS>;<CAP>;IGNORE <U0446> <CYR-TSE>;<BAS>;<MIN>;IGNORE <U0426> <CYR-TSE>;<BAS>;<CAP>;IGNORE <U0447> <CYR-TSHE>;<BAS>;<MIN>;IGNORE <U0427> <CYR-TSHE>;<BAS>;<CAP>;IGNORE <U045F> <CYR-DCHE>;<BAS>;<MIN>;IGNORE <U040F> <CYR-DCHE>;<BAS>;<CAP>;IGNORE <U0448> <CYR-SHA>;<BAS>;<MIN>;IGNORE <U0428> <CYR-SHA>;<BAS>;<CAP>;IGNORE <U0449> <CYR-SHTSHA>;<BAS>;<MIN>;IGNORE <U0429> <CYR-SHTSHA>;<BAS>;<CAP>;IGNORE <U044A> <CYR-SIGDUR>;<BAS>;<MIN>;IGNORE <U042A> <CYR-SIGDUR>;<BAS>;<CAP>;IGNORE <U044B> <CYR-YEROU>;<BAS>;<MIN>;IGNORE <U042B> <CYR-YEROU>;<BAS>;<CAP>;IGNORE <U044C> <CYR-SIGMOUIL>;<BAS>;<MIN>;IGNORE <U042C> <CYR-SIGMOUIL>;<BAS>;<CAP>;IGNORE <U044D> <CYR-E>;<BAS>;<MIN>;IGNORE <U042D> <CYR-E>;<BAS>;<CAP>;IGNORE <U044E> <CYR-YOU>;<BAS>;<MIN>;IGNORE <U042E> <CYR-YOU>;<BAS>;<CAP>;IGNORE <U044F> <CYR-YA>;<BAS>;<MIN>;IGNORE <U042F> <CYR-YA>;<BAS>;<CAP>;IGNORE
order_start <HAN>;forward;forward;forward;forward,position
<U4E00>......<U9FA5> <U4E00>......<U9FA5>;IGNORE;IGNORE;IGNORE # order_end # END LC_COLLATE
1 List with required result of the default
2 List with required result after example of tailoring
Note: In this draft, annexes identified with a digit are intended to be normative. Annexes identified with a letter are intended to be informative.
Note: these criteria have been subject to change. They represented an optimum. Compromises had to be done according to diverse circumstances later on.
1. The mechanism must provide a deterministic way to collate graphic character strings. Thus, if two strings of graphic characters are different when directly compared in binary, the order assigned by the mechanism should be always the same and the strings will be considered different even if they are externally considered equivalent by humans.
2. For each script, if this is possible, the order assigned will be culturally acceptable to a majority of users of that script.
3. The repertoire of characters supported should be at least the one defined by Level three implementation (the richest possible) of ISO/IEC 10646.
4. The ordering table will be defined keeping in mind the following points concerning internal string transformation number assignments:
- the assignments are processed as efficiently as possible if they are stored in a permanent way, and
- the assignments allow direct and correct one-pass binary comparisons between two resultant number sequences.
The table is defined this way because it is always possible to define an order between two strings by whatever complex method is used. However, real systems must have a minimum level of performance. Once assignment is made on original strings, the result must be storable without modification. Also, the result must be directly reusable for comparison purposes, without having to redo the conversion process each time. This will also enable existing systems to make comparisons with minimum changes and sometimes without having to change programs.
5. There must be a mechanism to use the table as a template, primarily to optimise the process for the user's language. In the template, the order of a series of characters may be modified by simple a posteriori declaration, without having to specify the whole table again.
6. Given the reusable comparison keys obtained (see 4), it must be possible to reconstitute the original as is without the need to preserve it. This means that the reversibility of the process must be available to applications if required.
As valuable information, this list of requirements can already be satisfied by Canadian Standard CAN/CSA Z243.4.1 for West European languages, except that this standard is monoscript and does not support composite sequences as defined in ISO/IEC 10646. However, preliminary studies suggest that it is possible to extend the Canadian method to take into account both the multi-script requirement and the presence of composite sequences.
ISO/IEC 9945-2 (POSIX-2) allows the Canadian standard CAN/CSA Z243.4.1-1992 to be described. However, it could require modifications of the model to handle both the multi-script requirements and the need for composite sequences if an infinite repertoire is necessary for a given environment.
The application of this standard will not require full POSIX-2 conformance, but will be as compatible as possible with the POSIX LOCALE LC_COLLATE specification model. Otherwise, this standard will build on this specification model in attempting to make as few modifications as possible (particularly structural modifications).
Prehandling is essentially for modification and/or duplication of original records to render their fields context-independent prior to the comparison phase. Examples are:
- duplicating a string such as "41" for phonetic ordering into 3 strings for trilingual phonetic ordering usage (French, English and German"):
QUARANTE-ET-UN FORTY-ONE EINUNDVIERZIG
- removing or rotating characters that are a nuisance for special requirements of ordering; for example, in France, removing "de" in "de Gaulle" and not removing "De" in "De Gaulle" according to nobiliar origin or not, to give:
Gaulle (de) De Gaulle
- transform incomplete data into full form; for example, transform "Mc Arthur" to give "Mac Arthur"
- transform numbers so that the result will be ordered in numerical order and not positionally or according to phonetics, for example:
Given the strings "100" and "15",
- either separate each of these numbers in different fields from the rest of text and convert them entirely in standard numeric (binary) data to be ordered numerically and not textually, or
- pad/align numbers to make sure the one-phase default ordering mechanism will process them correctly:
"015" "100"
- transform Roman numerals into Arabic numbers after having determined the context (perhaps with the help of human interactive intervention or an expert system), as in the following French example:
CHAPITRE DIX might mean CHAPTER 010 or CHAPTER 509 ("dix" is the French word for 10, it is also the Roman numeral for 509). This generally requires context to be solved with total certainty.
Post-processing is essentially for modifying resulting keys, or appending the original string to keys so that the results of comparisons can determine differences in the case of homography when the prehandling phase, particularly, has been done. For example, there could be equivalencies if numerical values (for example, "010" and "10") have been standaredized in the prehandling phase. The default ordering mechanism has no knowledge that the original strings are different in such cases, but the predictability requirement still exists.
In particular, where different coding methods have been used in the original strings to be ordered in the same process, the posthandling phase can determine internal differences which would appear exactly the same on paper for end-users (for example, an ISO 2022 input stream intermixing ISO/IEC 6937 and ISO/IEC 8859).
The Default-Tailorable Ordering Mechanism does not cover the prehandling and posthandling phases. However, the mechanism does describe these phases. The presence of the phases is mandatory even if empty processes must be defined. These empty processes can be replaced if the need occurs.
CAN/CSA Z243.4.1 Canadian ordering standard
CAN/CSA Z243.230 Canadian minimum software localization parameters
IBM NLTC Volume 2 reference manual
IBM Egypt and Egypt Standards
Stefan Fuchs and Israel Standards
CEN TC304 Multilingual sorting standard project
LOCALES provisionally registered in x/Open or in SC22/WG15 (DKUUG.DK Internet site)
Règles du classement alphabétique en langue française et procédure informatisée pour le tri, Alain LaBonté, Ministère des Communications du Québec, 1988 -- ISBN 2-550-19046-7
Technique de réduction - Tris informatiques à quatre clés, Alain LaBonté, Ministère des Communications du Québec, 1989 -- ISBN 2-550-19965-0
Fonctions de systèmes - Soutien des langues nationales, Alain LaBonté, Ministère des Communications du Québec, 1988
National Language Architecture - Klaus daube, SHARE EUROPE White Paper, 1990
The principles of numeric table assignments are the following:
a) All characters are assigned a value corresponding to the identification of the script. Each script header is given a name mainly for the purposes of tailoring. However, conceptually, a number corresponding to the identification of the script can be assigned to this name, which then serves as a variable. This script identification data is informative only and does not serve in the comparison process. However, the identification data may be necessary for determining the scanning direction of diacritics for that script. This data must sometimes be retained alongside with the ordering strings to meet the reversibility requirements above (capacity to reconstitute the original strings given the different subkeys that are a result of the multilevel transformation).
b) Each letter is assigned a basic normalised letter value (or a pair or a triad for ligatures). The assignment is made as first level (ideographic characters are assigned their standardised CJK order, corresponding to the order they have in ISO/IEC 10646). The assignment is in the order of the alphabet to which they belong - for example, LATIN CAPITAL LETTER E WITH CIRCUMFLEX ACCENT is assigned a numerical value corresponding to the same value attributed to LATIN SMALL LETTER E. Such a definition is valid for most Latin-script-based languages. Vietnamese would require a different definition, E CIRCUMFLEX being a base letter in this language.
c) Each letter is assigned an n-plet of values (or 2 n-plets or 3 n-plets for ligatures) as 2nd level, which corresponds to the maximum realistic number of combining characters encountered in all world scripts for a given basic letter to which it applies. When there is only one diacritic, the second and third elements of the triplet are place holders. When there is no diacritic, three place holders are provided in each triplet, and so on. For each diacritic of a triplet, a flag is put in the next-to-last level to indicate an integrated diacritic (as opposed to a combining character). Note that for level 1 conformance to ISO 10646 (or if composite sequences are all predefined by "collating- symbol" statements), the n-plet of values for each character can be made equal to a single token because no analysis of combining diacritics will ever be necessary (and the next-to-last level, reserved for future use, will be empty).
Ideographs are assigned no value for this level according to ISO/IEC 10646 level 1 of conformance. This is because ideographs will be compared against completely different values simultaneously at the first level, and thus there will be no collision in comparison operations at this level. (Ideographs are not assigned equivalencies at the first level). Levels 2 or 3 of conformance could be processed with the same model as the one for letters, for theoretical combinations.
d) Each letter is assigned a value (or a pair or a triad for ligatures) as 3rd level, corresponding to the form of the letter (for example, upper or lower case for Latin, or free-standing, initial, medial, or ending form for Arabic). Ideographs are assigned no value for this level.
e) This paragraph was removed from the previous version.
f) Each special character (a character not specifically belonging to a specific script, such as COPYRIGHT SIGN, or COMMA) is assigned a value as 4th level value. This is a world-wide common numerical value that is preceded with the position it occupies in the original string to be processed. Currently, no other level value is assigned in the default table.
g) this paragraph was removed from the previous draft.
Given such table assignments, a table of scanning directions will be provided for each script and for each of the levels. Note that scanning direction is not linked to the natural script direction, since the characters are already linearly coded according to their script direction (logical direction). This is linked to the direction in which each level is processed for ordering. For example, in French, diacritical marks are scanned backward in case of first level homography: accents are not considered for ordering in French except for specifying the order of quasi-homographs. In this case, the last difference in the words determines the order, thus explaining the retrograde scanning (an example of an ordered list is: "cote", "côte", "coté", "côté"). When string direction is retrograde for a character in a given level, the value assigned to this level is placed in front of the resulting key instead of at the end for this level.
Given that each subkey is established at all levels, and provided that a low-value delimiter is placed between each subkey , all subkeys can be concatenated at once and used for subsequent comparisons. (If values are carefully chosen for table-building, no low-value delimiter is necessary). Given that all the information is present, the original string provided can be reconstituted from the subkeys.
Reduction techniques exist to minimise the amount of storage requirements for that method without affecting the comparison process if keys are to be preserved for maximum performance reasons. (see References).
The basic philosophy behind the culturally-correct character string comparison engine is the following:
1. No comparison mechanism is culturally correct when it assumes that the order is based on numerical internal values of raw character strings, and with any standard character set coding scheme.
2. If two strings are different, there must be a fully predictable order assigned to each one relative to each other one.
3. Ordering rules are language-related in a given script.
4. Whatever the language, the ordering rules are based on lexical order at the lowest level. Higher level transformation (done in a prehandling phase) produces character strings whose ordering is to be made as for any other lexical entry.
5. Each rule tentatively determines an order between two different character strings by operating a single binary comparison on binary strings that represent the result of a straightforward and context-independent transformation of the characters of each string. (Transformations typically involve ignoring, or giving a specific or generic weight to each character, or retaining the position of a character as a weight while assigning it a second weight depending on the character itself. Such transformations may be done by scanning the string forward or backward in the logical string sequence, except for the positional case which only implies the logical positions of a string).
6. Transformations can typically produce equivalencies for two different character strings transformed into two identical binary strings. Thus, when such cases are encountered, other sequential series of transformation are necessary until, at a final level, all ties are solved (at the last level, binary strings are necessarily different if two original character strings to be compared are different). If the only goal of a comparison is to determine equivalence up to a certain level of precision, then character transformation is required up to a certain level only.
7. The default table will define as many levels as necessary to produce a fully predictable order for two different character strings. This involves up to five comparison levels if characters of ISO/IEC 10646 level 1 are used, and up to six comparison levels if characters of ISO/IEC 10646 level 3 are used. An extra level (used for data management and not of particular significance for comparisons) is also defined (see 9 below).
8. A whole character string is transformed as many times as necessary into up to six different levels. Thus, it must be possible to deduce the original character string from all the different binary transformations concatenated into one binary string (reversibility property of the transformation process).
9. Different scripts may have different properties as to the way each level is processed. Thus, to ensure the operation will be reversed, an extra level transformation table is necessary to identify the script to which each character belongs.
Removed from the previous version
Text will be added if necessary
AFNOR Z.44-001 ANSI/NISO Z39.75-199X (project at time of editing WD3) DIN 5007 Caractères hébreux non encore publiés dans l'ISO/CEI 10646_1
<U0591> IGNORE;IGNORE;IGNORE;<U0591> #accent_etnahta <U0592> IGNORE;IGNORE;IGNORE;<U0592> #accent_segol <U0593> IGNORE;IGNORE;IGNORE;<U0593> #accent_shalshelet <U0594> IGNORE;IGNORE;IGNORE;<U0594> #accent_zaqef_qatan <U0595> IGNORE;IGNORE;IGNORE;<U0595> #accent_zaqef_gadol <U0596> IGNORE;IGNORE;IGNORE;<U0596> #accent_tipeha <U0597> IGNORE;IGNORE;IGNORE;<U0597> #accent_revia <U0598> IGNORE;IGNORE;IGNORE;<U0598> #accent_zarqa <U0599> IGNORE;IGNORE;IGNORE;<U0599> #accent_pashta <U059A> IGNORE;IGNORE;IGNORE;<U059A> #accent_yetiv <U059B> IGNORE;IGNORE;IGNORE;<U059B> #accent_tevir <U059C> IGNORE;IGNORE;IGNORE;<U059C> #accent_geresh <U059D> IGNORE;IGNORE;IGNORE;<U059D> #accent_geresh_muqdam <U059E> IGNORE;IGNORE;IGNORE;<U059E> #accent_gershayim <U059F> IGNORE;IGNORE;IGNORE;<U059F> #accent_qarney_para <U05A0> IGNORE;IGNORE;IGNORE;<U05A0> #accent_telisha_gedola <U05A1> IGNORE;IGNORE;IGNORE;<U05A1> #accent_pazer <U05A3> IGNORE;IGNORE;IGNORE;<U05A3> #accent_munah <U05A4> IGNORE;IGNORE;IGNORE;<U05A4> #accent_mahapakh <U05A5> IGNORE;IGNORE;IGNORE;<U05A5> #accent_merkha <U05A6> IGNORE;IGNORE;IGNORE;<U05A6> #accent_merkha_kefula <U05A7> IGNORE;IGNORE;IGNORE;<U05A7> #accent_darga <U05A8> IGNORE;IGNORE;IGNORE;<U05A8> #accent_qadma <U05A9> IGNORE;IGNORE;IGNORE;<U05A9> #accent_telisha_qetana <U05AA> IGNORE;IGNORE;IGNORE;<U05AA> #accent_yerah_ben_yomo <U05AB> IGNORE;IGNORE;IGNORE;<U05AB> #accent_ole <U05AC> IGNORE;IGNORE;IGNORE;<U05AC> #accent_iluy <U05AD> IGNORE;IGNORE;IGNORE;<U05AD> #accent_dehi <U05AE> IGNORE;IGNORE;IGNORE;<U05AE> #accent_zinor <U05AF> IGNORE;IGNORE;IGNORE;<U05AF> #mark_masora_circle <U05C4> IGNORE;IGNORE;IGNORE;<U05C4> #mark_upper_dot # # Caractères grecs non définis dans le JUC publié # Lettres archaïques/Archaic letters # <U03D7> <g201>;<BAS>;<AMI>;IGNORE <U03D9> <g201>;<BAS>;<AMI>;IGNORE # Lettres coptes/Coptic letters <U03DB> <g202>;<BAS>;<AMI>;IGNORE <U03DD> <g202>;<BAS>;<AMI>;IGNORE <U03DF> <g202>;<BAS>;<AMI>;IGNORE <U03E1> <g202>;<BAS>;<AMI>;IGNORE # <U03F4> <g204>;<DTT>;<AMI>;IGNORE <U03F5> <g204>;<BAS>;<AMI>;IGNORE