[CPWG] Variants and Process

Bill Jouris b_jouris at yahoo.com
Thu Oct 21 17:47:59 UTC 2021


Dear Olivier, 
That is the problem I see as well.  My sense is the both the Latin GP, and the Integration Panel (which is the next level higher) desire primarily to minimize the number of variants.  Two codepoints which are identical, such as the Latin schwa and the Latin turned E, obviously cannot be distinguished by anyone, and so are necessarily variants.  (Although one of my fellow Panel members argued against variant status even for that specific case.)  But how strict the constraints were on making two codepoints reflects that desire for minimization.  Also, in at least one case, the Integration Panel requested the Latin GP review (and modify) some variant findings because one set of codepoints which were variants of each other was "too large."  ("Too large" wasn't defined.  Nor was there indication of why one would care. Certainly it wouldn't impact the performance of the software doing the automatic filtering of proposed TLDs.) 
Given that   a) the Panel members are experts,   b) we were doing side-by-side comparisons, and   c) we knew that we were looking at two different codepoints it seemed to me that if any of us couldn't tell the difference, then neither could the average user looking at a domain name in isolation.  Setting a higher threshold seems to me like phishing, and especially pharming, enablement. 
It also might appear that having a group of codepoints which are not variants, but which users cannot really distinguish, provides a marketing opportunity.  Not to sell to bad actors, who are typically one-off buyers and so not worth pursuing.  But to sell defensive registrations to legitimate registrants, who merely want to make sure that their customers find them.  Such defensive registrations would be likely to be renewed indefinitely, making them worthwhile even in a low margin business.** 
Bill 
** 5 of the 7 members of the Latin Panel being employees of one or another of the contracted parties.  I believe most of them were sincerely making a good faith effort to do the right thing.  But their experience there may nevertheless have colored their perceptions. 

Sent from Yahoo Mail on Android 
 
  On Thu, Oct 21, 2021 at 2:13 AM, Olivier MJ Crépin-Leblond<ocl at gih.com> wrote:    Dear Bill,
 
 thank you for explaining this in further detail. The problem I see with the process here, is that *experts* have been used to notice a difference. Because they are experts, they might be able to see differences which the average Internet end user will not. And this is the concern I have: is the panel of experts being conservative enough in making their decisions? If there is any suspicion about two characters being a variant, would a conservative approach them as variants?
 What is the end goal of identifying variants? If it is to avoid the use of IDNs for phishing, then the only approach possible should be a conservative approach.
 Kindest regards,
 
 Olivier
 
 On 21/10/2021 05:17, Bill Jouris via CPWG wrote:
  
 
 After some of the discussion in the chat in this morning's meeting, I feel like a little more extended discussion about variants might be helpful.  
  The repertoire for the Latin script consists of "codepoints" -- some are letters and some are letters plus diacritics.  "Variants" are pairs of codepoints which are indistinguishable.  That is, in the process that the Panel used, 5 of the 7 experts on the panel couldn't see a difference.  The Latin GP did not look at diacritics per se.  Just at codepoints which might involve diacritics.  
  Thus, a codepoint consisting of a letter with a caron diacritic ( ̌ ) and a codepoint with the same letter combined with a breve diacritic (  ̆  ) may always result in a variant pair, but only because the Panel's comparison worked out that way.  For example, a G with caron (ǧ) and a G with breve (ğ) are variants.   On the other hand, a caron and a macron ( ¯ ) never result in a variant pair.   
  However some cases with diacritics are mixed.  For example, a codepoint consisting of letter with a dot above ( ˙ ) and a codepoint consisting of a letter with an acute accent results in a variant pair for letters C (ċ vs ć), N (ṅ vs ń), and Z (ż vs ź ). But, in the Panel's original finding, not for letters E (ė vs é), and I (i vs í).   
  (Note that a majority of the Panel found the vowels to produce variants as well.  Just not a supermajority, as required by the process the Panel had adopted.  As a result, the Panel's official position is that, in various cases not just this one, even though a majority of the experts, looking side by side, could not see a difference, the average "reasonably careful user" will somehow magically notice the difference when looking at a domain name.)  
  Then we have cross-script variants, including those identified by other Panels.  For example, the Greek Panel found that the Greek letter Iota was a variant both of the Latin letter I and the Latin letter I with acute.   As a result I and I with acute became variants. 
  But there is no Greek letter which is a variant of the Latin letter E.  So we are left with a situation where the dot above diacritic and the acute produce variants for all letters EXCEPT for the letter E.  (When I suggested that, for consistency, we should make the letter E case a variant as well, the response was "It is more important that we follow our process than that we have consistency.")   
  TLDs consist of a series of codepoints.  Proposed TLDs which differ only by one or more variants from another TLD will be automatically be rejected in the software.  For example, .çom  would be allowed, despite its similarity to .com, because C with Cedilla is not a variant of C.  Also .сом (using Cyrillic letters) would be allowed because, while C and the Cyrillic letter Es are variants, and O and the Cyrillic letter O are variants, the letter M and the Cyrillic letter Em are not variants (the Panel was directed to ignore Upper Case when deciding what might confuse users).  But .cóm could be rejected, because O and O with acute are variants. 
  "Confusables" are pairs of codepoints which some for the experts could not distinguish, just not enough to be designated as variants.  Confusables are intended as suggestions for the panel which will manually review the proposed TLDs.  
  
  I hope this all will help everyone understand what we are looking at here.  
  Regards, Bill Jouris  
  _______________________________________________
CPWG mailing list
CPWG at icann.org
https://mm.icann.org/mailman/listinfo/cpwg

_______________________________________________
By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on. 
 -- 
Olivier MJ Crépin-Leblond, PhD
http://www.gih.com/ocl.html   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/cpwg/attachments/20211021/57539475/attachment.html>


More information about the CPWG mailing list