[Latingp] Homoglyphs within Latin script

Bill Jouris bill.jouris at insidethestack.com
Fri Jan 5 15:45:10 UTC 2018


And yet, we sat there in Abu Dhabi and saw the Greek GP give a presentation where they listed as variants code points which were letter eta with a variety of diacritic marks.  And the two members of the Integration Panel who were present made no complaint whatsoever about it.  

So maybe, just maybe, the real criteria are not anywhere near as narrow as that quote would suggest.  
  Bill Jouris
Inside Products
bill.jouris at insidethestack.com
831-659-8360
925-855-9512 (direct)

      From: "Tan Tanaka, Dennis via Latingp" <latingp at icann.org>
 To: Michael Bauland <Michael.Bauland at knipp.de>; Mats Dufberg <mats.dufberg at iis.se> 
Cc: "latingp at icann.org" <latingp at icann.org>
 Sent: Friday, January 5, 2018 7:32 AM
 Subject: Re: [Latingp] Homoglyphs within Latin script
   
Michael, we received this feedback through email (attached). Look for this paragraph: 

“In the context of the Root Zone, the Procedure is quite clear in that it considers simple similarity of appearance to be outside the scope of the Root Zone LGR. In admitting exact homoglyphs, the IP has been making the argument that ‘e’ in Latin (U+0065) and ‘е’ in Cyrillic(U+0435) are not just visually indistinguishable, but that their distinct code points effectively represent a disunification by script property. (A disunification not unlike that of 01DD and 0259, which are disunified based on case, or the two sets of Arabic digits disunified largely on directional properties).”

-Dennis

On 1/5/18, 10:29 AM, "Michael Bauland" <Michael.Bauland at knipp.de> wrote:

    Hi Dennis,
    
    On 05.01.2018 16:13, Tan Tanaka, Dennis wrote:
    > Hi Michael,
    > 
    > They are not the same character. They look alike in lower case, but are different in upper case (i.e. disunification by case property). The IP briefly discussed this case of 01DD and 0259 in their feedback to our Principles document and suggested that these two should not be variants. Hence my question about more evidence.
    
    sorry, I must have overlooked this. Which feedback are you talking
    about? Not the one from 2017-03-22 "GP Proposal Latin
    Script_Feedback_IP_V2F.docx", right? Is the document in the Latin GP
    drop box account?
    
    Michael
    
    
    -- 
    ____________________________________________________________________
        |      |
        | knipp |            Knipp  Medien und Kommunikation GmbH
          -------                    Technologiepark
                                    Martin-Schmeisser-Weg 9
                                    44227 Dortmund
                                    Germany
    
        Dipl.-Informatiker          Fon:    +49 231 9703-0
                                    Fax:    +49 231 9703-200
        Dr. Michael Bauland        SIP:    Michael.Bauland at knipp.de
        Software Development        E-mail: Michael.Bauland at knipp.de
    
                                    Register Court:
                                    Amtsgericht Dortmund, HRB 13728
    
                                    Chief Executive Officers:
                                    Dietmar Knipp, Elmar Knipp
    

#yiv3502519930 #yiv3502519930 -- _filtered #yiv3502519930 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv3502519930 {panose-1:2 11 4 0 0 0 0 0 0 0;} _filtered #yiv3502519930 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv3502519930 {panose-1:2 11 4 0 0 0 0 0 0 0;}#yiv3502519930 #yiv3502519930 p.yiv3502519930MsoNormal, #yiv3502519930 li.yiv3502519930MsoNormal, #yiv3502519930 div.yiv3502519930MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;color:black;}#yiv3502519930 a:link, #yiv3502519930 span.yiv3502519930MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv3502519930 a:visited, #yiv3502519930 span.yiv3502519930MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv3502519930 p.yiv3502519930msonormal0, #yiv3502519930 li.yiv3502519930msonormal0, #yiv3502519930 div.yiv3502519930msonormal0 {margin-right:0in;margin-left:0in;font-size:11.0pt;color:black;}#yiv3502519930 span.yiv3502519930EmailStyle19 {color:windowtext;}#yiv3502519930 .yiv3502519930MsoChpDefault {font-size:10.0pt;} _filtered #yiv3502519930 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv3502519930 div.yiv3502519930WordSection1 {}#yiv3502519930 Dear All,    Please find input from the Integration Panel in response for the call for comments on the principles documents.    Regards,
Sarmad The Integration Panel (IP) has reviewed "Principles for Inclusion and Exclusion of Code Points in Latin Script for the Root Zone (Latin LGR)e" and has the following comments: The IP congratulates the Latin GP on the formulation of its "Principles for Inclusion and Exclusion of Code Points in Latin Script for the Root Zone (Latin LGR)". They appear to cover the important considerations and will likely serve the GP well in arriving at at list of proposed candidate code points. The IP would like to caution that the final decision on whether to include or exclude a code point may not be possible by rote application of these (or any other set of) principles, and that additional factors may have to be considered in individual cases.  The IP is looking forward to the next stage of the Latin GP's work and to reviewing actual examples of draft code points. Additional notes: The IP would like to note that all entries in an LGR need to be in Unicode Normalization Form C (see RFC 7940) and further that IDNA requires NFC, even if it doesn't agree with the native typing order, or conventions regarding precomposed, decomposed or mixed composed usage.  RFC5890 states:  "A "U-label" is an IDNA-valid string of Unicode characters, in Normalization Form C (NFC)". Because entries are normalized dual encoding cannot exist. In creating the repertoire each combining sequence needs to be individually justified and should be separately enumerated; combining marks should not be individually members of the repertoire. In applying these principles, attention must be paid to the foundational documents for this work as summarized in the "Guidelines for Developing Script-Specific Label Generation Rules for Integration into the Root Zone LGR". Further, the exclusion principles should mention explicitly that the LGR repertoire is constrained by MSR: « A code point not in the latest version of the MSR is excluded. If there is a clear need to add one, the GP will contact the Integration Panel to assess the possibility of adding one to the MSR ».  The IP has reviewed "Analysis of Variants in the Latin Script for the Root Zone" and has the following comments: The actual guiding principle (contained in the second paragraph of the document) appears to cover the important considerations and will likely serve the GP well in arriving at at list of proposed candidate variants. The IP would like to caution that the final decision on whether to include or exclude a variant may not be possible by rote application of this (or any other) principle, and that additional factors may have to be considered in individual cases. The IP is looking forward to the next stage of the Latin GP's work and to reviewing actual examples of draft variants. Additional notes: The IP has some concerns about the remainder of the document. The procedure sets a very narrow limit on the kinds of cases that can be considered variants for the Root Zone; this is the basis of the statement by the IP that is quoted in a footnote. It might perhaps be better if this statement were incorporated into the definition of "scope". In that section, the opening remark about script mixing seem unconnected to the discussion that follows. A straight listing of which related scripts the GP will consider would be more useful. The IP  would like to point out that the example given the document of Latin è (U+00E8) and Cyrillic ѐ (U+0450) may be moot because the final Cyrillic repertoire does not contain U+0450. In general, it is expected that the analysis of cross-script repertoires remain limited to code points that are in the respective scripts' LGRs or draft LGRs. The general discussion of "classes of variants" may be "of interest to the reader", but it isn't helpful in understanding which principles the Latin GP will follow in deciding whether something is a variant or not -- most of the items discussed are not applicable in the context of the Root Zone LGR.  In the context of the Root Zone, the Procedure is quite clear in that it considers simple similarity of appearance to be outside the scope of the Root Zone LGR. In admitting exact homoglyphs, the IP has been making the argument that ‘e’ in Latin (U+0065) and ‘е’ in Cyrillic(U+0435) are not just visually indistinguishable, but that their distinct code points effectively represent a disunification by script property. (A disunification not unlike that of 01DD and 0259, which are disunified based on case, or the two sets of Arabic digits disunified largely on directional properties).

In the context of other script LGRs for the Root Zone, the IP has argued strongly against embodying rules intended to deal with spelling issues. Therefore, any orthographic variation (spelling differences) would require a very compelling case being made; the examples given may not rise to that level. For instance, ‘ss’ (U+0073 U+0073) and ‘ß’ (U+00DF) are separately available on the second level, in the .de ccTLD (and presumably others). This would strongly argue against the claim that German usage would require them to be variants - in fact the opposite might be concluded. Consideration of established practice in existing Latin-based IDNs ought to be an important principle. The procedure makes reference to the "Least Astonishment Principle". This principle argues against solutions that produce unexpected or surprising behavior. Having the Root Zone exhibit fundamentally different design decisions with respect to variants than those found on the second level would have to be justified by strong arguments based on factors special to the Root Zone. Absent such factors, the expectation would be that the various levels are more or less compatible in their treatment of IDN labels for a given script. Finally, the claimed normalization exceptions appear based on a misunderstanding of the normalization algorithm. In normalizing to precomposed form (Normalization Form C), the first step is to fully decompose the input string and then to reorder all combining marks in a canonical order. Because of that, the two examples of e with grave and dot below would become identical at that stage of normalization. In the final stage of the algorithm, as much of the sequence as possible is composed. But because both inputs have the same fully decomposed and reordered form, their final NFC form is identical. Or, put differently, only one of the two forms is in NFC, the other is unnormalized and as such not admissible in the LGR.   _______________________________________________
Latingp mailing list
Latingp at icann.org
https://mm.icann.org/mailman/listinfo/latingp


   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/latingp/attachments/20180105/3de15162/attachment-0001.html>


More information about the Latingp mailing list