[CPWG] Variants and Process

Sun Oct 24 00:48:27 UTC 2021

Dear Mr. Bill Jouris,

Many thanks for provoking more thougts.

I will catch up on the ICANN Community Wiki.

I am sorry, the mailing list is needing more tracking time than
the ICANN Community Wiki.

I hope there is method in ICANN to summarize some threads being 
discussed
on the mailing list and posted to the ICANN Community Wiki.

Sincerely,

Gopal T V
0 9840121302
https://vidwan.inflibnet.ac.in/profile/57545
https://www.facebook.com/gopal.tadepalli

PS: The CPWG mailing list ought to automatically tag to the CPWG Space 
on the
ICANN. In other words, I shuld be seeing all the posrt from wihtin the 
Wiki too.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Dr. T V Gopal
Professor
Department of Computer Science and Engineering
College of Engineering
Anna University
Chennai - 600 025, INDIA
Ph : (Off) 22351723 Extn. 3340
       (Res) 24454753
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

On 2021-10-23 23:46, Bill Jouris wrote:
> Dear Dr. Gopal,
> 
> There may be such software out there.  I can only say that, if there
> is, I am not familiar with it.  I rather suspect that, if there is,
> one of the first parameters required is "How similar, or how
> different, do you want?  Set a threshold."  Followed by "Do you
> require consistence?  That is, if two diacritics produce variants in
> some cases, must they do so in all cases?"  Which, as you say, leaves
> the most critical question still to be answered.
> 
> In the end, I think we are stuck with some variation of a "consensus
> of experts" judgement.  The more cogent question is, What kinds of
> experts?  That is, linguists?  Or experts in human perception
> (specifically visual perception)?  Or experts in the behavior of end
> users.  The IDN project has opted, essentially, for linguists --
> whether by default or actual preference I do not know.
> 
> Regards,
> 
> Bill Jouris
> 
>  On Friday, October 22, 2021, 07:32:40 PM PDT, <gopal at annauniv.edu>
> wrote:
> 
> Dear Bill Jouris,
> 
> Many thanks again for your presentation to the CPWG on 6 October 2021.
> 
> It has been a fantastic effort by your Seven Member team from six
> different
> countries.
> 
> Ref Slide #12: UNICODE 00FE and 01A5
> 
> The quantification for decision making was based on a 5-point linear
> scale and
> the Seven experts using "2-4" range only. Also, this for three popular
> 
> typefaces.
> 
> I know this is just one sample and your question in the next slide
> "How
> Much is
> Enough ?" is very vital.
> 
> Is there a tool / simulator that makes it all more generic for larger
> samples, different
> languages and different quantificatio scales such as the Likert Scale
> ?
> 
> We can then anticipate the code generator within acceptable confidence
> 
> interval.
> 
> Once again a big thank you from me for such a nice work and
> presentation.
> 
> Please advise.
> 
> Sincerely,
> 
> Gopal T V
> 0 9840121302
> https://vidwan.inflibnet.ac.in/profile/57545
> https://www.facebook.com/gopal.tadepalli
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Dr. T V Gopal
> Professor
> Department of Computer Science and Engineering
> College of Engineering
> Anna University
> Chennai - 600 025, INDIA
> Ph : (Off) 22351723 Extn. 3340
>       (Res) 24454753
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> On 2021-10-23 03:28, Bill Jouris via CPWG wrote:
>> Dear Roberto,
>> 
>> Not all that off-topic.  In general, you are correct that
> combinations
>> of letters got ignored.  For example, a Latin letter R, followed by
> a
>> Latin Letter N is, to my mind, hard to distinguish from a Latin
> letter
>> M.  If you saw .corn, would you realize it was about maize, rather
>> than being a normal .com?  But it didn't get considered in
> identifying
>> variants.
>> 
>> The Sharp S is the exception.  The panel concluded that the Sharp S
>> (ß) and a double S (ss) are variants.  Most variants are
>> bidirectional -- that is, it doesn't matter which one was registered
>> first, the other is blocked.  But this case is different.  If the
> name
>> with a double S is registered first, then the Sharp S is indeed
>> blocked.  However, if the name with Sharp S is registered first,
> then
>> the variant is considered "allocatable."  That is the same name with
> a
>> double S rather than Sharp S _can_ be registered, provided:
>> 1) ALL of the instances of Sharp S in the name (if there is more
> than
>> one) are changed to double S, and
>> 2) the name is registered to the same registrant.
>> 
>> On the other hand, the possibility of substituting a vowel with
>> diaresis for the same vowel followed by E did not come up.  That is
>> the way I learned German (as an American) long ago.  But the native
>> German speakers on the Panel did not consider it worth worrying
> about.
>> 
>> 
>> Sorry if that doesn't totally clarify things.  But that's all I've
> got
>> on the subject.
>> 
>> Bill Jouris
>> 
>>  On Thursday, October 21, 2021, 11:47:20 PM PDT, Roberto Gaetano
>> <roberto_gaetano at hotmail.com> wrote:
>> 
>>  Dear Bill,
>> 
>> I wonder whether I am off-topic with this question, but here it is
>> anyway.
>> Has the Latin GP considered an additional potential confusion coming
>> from cases like the german equivalency between “ae” and “ä”
>> or “ss” and “ß”? Just to make an example, the Austrian
>> Touring Club (ÖAMTC) has the site oeamtc.at [2], as it is customary
>> in german-speaking countries to get around the problem in this way.
>> 
>> This is most probably out of scope, because the work is likely to be
>> limited to single characters and not combination of characters, but
>> from the user’s point of view it could be a source of confusion
>> anyway.
>> 
>> Thanks,
>> Roberto
>> 
>>> On 21.10.2021, at 19:47, Bill Jouris via CPWG <cpwg at icann.org>
>>> wrote:
>>> 
>>> Dear Olivier,
>>> 
>>> That is the problem I see as well.  My sense is the both the Latin
>>> GP, and the Integration Panel (which is the next level higher)
>>> desire primarily to minimize the number of variants.  Two
> codepoints
>>> which are identical, such as the Latin schwa and the Latin turned
> E,
>>> obviously cannot be distinguished by anyone, and so are necessarily
>>> variants.  (Although one of my fellow Panel members argued against
>>> variant status even for that specific case.)  But how strict the
>>> constraints were on making two codepoints reflects that desire for
>>> minimization.  Also, in at least one case, the Integration Panel
>>> requested the Latin GP review (and modify) some variant findings
>>> because one set of codepoints which were variants of each other was
>>> "too large."  ("Too large" wasn't defined.  Nor was there
> indication
>>> of why one would care. Certainly it wouldn't impact the performance
>>> of the software doing the automatic filtering of proposed TLDs.)
>>> 
>>> Given that
>>> a) the Panel members are experts,
>>> b) we were doing side-by-side comparisons, and
>>> c) we knew that we were looking at two different codepoints
>>> it seemed to me that if any of us couldn't tell the difference,
> then
>>> neither could the average user looking at a domain name in
>>> isolation.  Setting a higher threshold seems to me like phishing,
>>> and especially pharming, enablement.
>>> 
>>> It also might appear that having a group of codepoints which are
> not
>>> variants, but which users cannot really distinguish, provides a
>>> marketing opportunity.  Not to sell to bad actors, who are
> typically
>>> one-off buyers and so not worth pursuing.  But to sell defensive
>>> registrations to legitimate registrants, who merely want to make
>>> sure that their customers find them.  Such defensive registrations
>>> would be likely to be renewed indefinitely, making them worthwhile
>>> even in a low margin business.**
>>> 
>>> Bill
>>> 
>>> ** 5 of the 7 members of the Latin Panel being employees of one or
>>> another of the contracted parties.  I believe most of them were
>>> sincerely making a good faith effort to do the right thing.  But
>>> their experience there may nevertheless have colored their
>>> perceptions.
>>> 
>>> Sent from Yahoo Mail on Android [1]
>>> 
>>> On Thu, Oct 21, 2021 at 2:13 AM, Olivier MJ Crépin-Leblond
>>> <ocl at gih.com> wrote:
>>> 
>>> Dear Bill,
>>> 
>>> thank you for explaining this in further detail. The problem I see
>>> with the process here, is that *experts* have been used to notice a
>>> difference. Because they are experts, they might be able to see
>>> differences which the average Internet end user will not. And this
>>> is the concern I have: is the panel of experts being conservative
>>> enough in making their decisions? If there is any suspicion about
>>> two characters being a variant, would a conservative approach them
>>> as variants?
>>> What is the end goal of identifying variants? If it is to avoid the
>>> use of IDNs for phishing, then the only approach possible should be
>>> a conservative approach.
>>> Kindest regards,
>>> 
>>> Olivier
>>> 
>>> On 21/10/2021 05:17, Bill Jouris via CPWG wrote:
>>> 
>>> After some of the discussion in the chat in this morning's meeting,
>>> I feel like a little more extended discussion about variants might
>>> be helpful.
>>> 
>>> The repertoire for the Latin script consists of "codepoints" --
> some
>>> are letters and some are letters plus diacritics.  "Variants" are
>>> pairs of codepoints which are indistinguishable.  That is, in the
>>> process that the Panel used, 5 of the 7 experts on the panel
>>> couldn't see a difference.  The Latin GP did not look at diacritics
>>> per se.  Just at codepoints which might involve diacritics.
>>> 
>>> Thus, a codepoint consisting of a letter with a caron diacritic (
> ̌
>>> ) and a codepoint with the same letter combined with a breve
>>> diacritic (  ̆  ) may always result in a variant pair, but only
>>> because the Panel's comparison worked out that way.  For example, a
>>> G with caron (ǧ) and a G with breve (ğ) are variants.  On the
>>> other hand, a caron and a macron ( ¯ ) never result in a variant
>>> pair.
>>> 
>>> However some cases with diacritics are mixed.  For example, a
>>> codepoint consisting of letter with a dot above ( ˙ ) and a
>>> codepoint consisting of a letter with an acute accent results in a
>>> variant pair for letters C (ċ vs ć), N (ṅ vs ń), and Z (ż vs
>>> ź ). But, in the Panel's original finding, not for letters E (ė
> vs
>>> é), and I (i vs í).
>>> 
>>> (Note that a majority of the Panel found the vowels to produce
>>> variants as well.  Just not a supermajority, as required by the
>>> process the Panel had adopted.  As a result, the Panel's official
>>> position is that, in various cases not just this one, even though a
>>> majority of the experts, looking side by side, could not see a
>>> difference, the average "reasonably careful user" will somehow
>>> magically notice the difference when looking at a domain name.)
>>> 
>>> Then we have cross-script variants, including those identified by
>>> other Panels.  For example, the Greek Panel found that the Greek
>>> letter Iota was a variant both of the Latin letter I and the Latin
>>> letter I with acute.  As a result I and I with acute became
>>> variants.
>>> 
>>> But there is no Greek letter which is a variant of the Latin letter
>>> E.  So we are left with a situation where the dot above diacritic
>>> and the acute produce variants for all letters EXCEPT for the
> letter
>>> E.  (When I suggested that, for consistency, we should make the
>>> letter E case a variant as well, the response was "It is more
>>> important that we follow our process than that we have
>>> consistency.")
>>> 
>>> TLDs consist of a series of codepoints.  Proposed TLDs which differ
>>> _only_ by one or more variants from another TLD will be
>>> automatically be rejected in the software.  For example, .çom
>>> would be allowed, despite its similarity to .com, because C with
>>> Cedilla is not a variant of C.  Also .сом (using Cyrillic
>>> letters) would be allowed because, while C and the Cyrillic letter
>>> Es are variants, and O and the Cyrillic letter O are variants, the
>>> letter M and the Cyrillic letter Em are not variants (the Panel was
>>> directed to ignore Upper Case when deciding what might confuse
>>> users).  But .cóm could be rejected, because O and O with acute
> are
>>> variants.
>>> 
>>> "Confusables" are pairs of codepoints which some for the experts
>>> could not distinguish, just not enough to be designated as
> variants.
>>> Confusables are intended as suggestions for the panel which will
>>> manually review the proposed TLDs.
>>> 
>>> I hope this all will help everyone understand what we are looking
> at
>>> here.
>>> 
>>> Regards,
>>> Bill Jouris
>>> 
>>> _______________________________________________
>>> CPWG mailing list
>>> CPWG at icann.org
>>> https://mm.icann.org/mailman/listinfo/cpwg
>>> 
>>> _______________________________________________
>>> By submitting your personal data, you consent to the processing of
>>> your personal data for purposes of subscribing to this mailing list
>>> accordance with the ICANN Privacy Policy
>>> (https://www.icann.org/privacy/policy) and the website Terms of
>>> Service (https://www.icann.org/privacy/tos). You can visit the
>>> Mailman link above to change your membership status or
>>> configuration, including unsubscribing, setting digest-style
>>> delivery or disabling delivery altogether (e.g., for a vacation),
>>> and so on.
>>> 
>>> --
>>> Olivier MJ Crépin-Leblond, PhD
>>> http://www.gih.com/ocl.html
>> 
>>  _______________________________________________
>> CPWG mailing list
>> CPWG at icann.org
>> https://mm.icann.org/mailman/listinfo/cpwg
>> 
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of
>> your personal data for purposes of subscribing to this mailing list
>> accordance with the ICANN Privacy Policy
>> (https://www.icann.org/privacy/policy) and the website Terms of
>> Service (https://www.icann.org/privacy/tos). You can visit the
> Mailman
>> link above to change your membership status or configuration,
>> including unsubscribing, setting digest-style delivery or disabling
>> delivery altogether (e.g., for a vacation), and so on.
>> 
>> 
>> 
>> Links:
>> ------
>> [1]
>> 
> https://go.onelink.me/107872968?pid=InProduct&c=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers&af_wl=ym&af_sub1=Internal&af_sub2=Global_YGrowth&af_sub3=EmailSignature
>> [2] http://oeamtc.at
> 
>> _______________________________________________
>> CPWG mailing list
>> CPWG at icann.org
>> https://mm.icann.org/mailman/listinfo/cpwg
>> 
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of
>> your personal data for purposes of subscribing to this mailing list
>> accordance with the ICANN Privacy Policy
>> (https://www.icann.org/privacy/policy) and the website Terms of
>> Service (https://www.icann.org/privacy/tos). You can visit the
> Mailman
>> link above to change your membership status or configuration,
>> including unsubscribing, setting digest-style delivery or disabling
>> delivery altogether (e.g., for a vacation), and so on.