[CPWG] Variants and Process
gopal at annauniv.edu
gopal at annauniv.edu
Sun Oct 24 00:48:27 UTC 2021
Dear Mr. Bill Jouris,
Many thanks for provoking more thougts.
I will catch up on the ICANN Community Wiki.
I am sorry, the mailing list is needing more tracking time than
the ICANN Community Wiki.
I hope there is method in ICANN to summarize some threads being
discussed
on the mailing list and posted to the ICANN Community Wiki.
Sincerely,
Gopal T V
0 9840121302
https://vidwan.inflibnet.ac.in/profile/57545
https://www.facebook.com/gopal.tadepalli
PS: The CPWG mailing list ought to automatically tag to the CPWG Space
on the
ICANN. In other words, I shuld be seeing all the posrt from wihtin the
Wiki too.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Dr. T V Gopal
Professor
Department of Computer Science and Engineering
College of Engineering
Anna University
Chennai - 600 025, INDIA
Ph : (Off) 22351723 Extn. 3340
(Res) 24454753
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
On 2021-10-23 23:46, Bill Jouris wrote:
> Dear Dr. Gopal,
>
> There may be such software out there. I can only say that, if there
> is, I am not familiar with it. I rather suspect that, if there is,
> one of the first parameters required is "How similar, or how
> different, do you want? Set a threshold." Followed by "Do you
> require consistence? That is, if two diacritics produce variants in
> some cases, must they do so in all cases?" Which, as you say, leaves
> the most critical question still to be answered.
>
> In the end, I think we are stuck with some variation of a "consensus
> of experts" judgement. The more cogent question is, What kinds of
> experts? That is, linguists? Or experts in human perception
> (specifically visual perception)? Or experts in the behavior of end
> users. The IDN project has opted, essentially, for linguists --
> whether by default or actual preference I do not know.
>
> Regards,
>
> Bill Jouris
>
> On Friday, October 22, 2021, 07:32:40 PM PDT, <gopal at annauniv.edu>
> wrote:
>
> Dear Bill Jouris,
>
> Many thanks again for your presentation to the CPWG on 6 October 2021.
>
> It has been a fantastic effort by your Seven Member team from six
> different
> countries.
>
> Ref Slide #12: UNICODE 00FE and 01A5
>
> The quantification for decision making was based on a 5-point linear
> scale and
> the Seven experts using "2-4" range only. Also, this for three popular
>
> typefaces.
>
> I know this is just one sample and your question in the next slide
> "How
> Much is
> Enough ?" is very vital.
>
> Is there a tool / simulator that makes it all more generic for larger
> samples, different
> languages and different quantificatio scales such as the Likert Scale
> ?
>
> We can then anticipate the code generator within acceptable confidence
>
> interval.
>
> Once again a big thank you from me for such a nice work and
> presentation.
>
> Please advise.
>
> Sincerely,
>
> Gopal T V
> 0 9840121302
> https://vidwan.inflibnet.ac.in/profile/57545
> https://www.facebook.com/gopal.tadepalli
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Dr. T V Gopal
> Professor
> Department of Computer Science and Engineering
> College of Engineering
> Anna University
> Chennai - 600 025, INDIA
> Ph : (Off) 22351723 Extn. 3340
> (Res) 24454753
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> On 2021-10-23 03:28, Bill Jouris via CPWG wrote:
>> Dear Roberto,
>>
>> Not all that off-topic. In general, you are correct that
> combinations
>> of letters got ignored. For example, a Latin letter R, followed by
> a
>> Latin Letter N is, to my mind, hard to distinguish from a Latin
> letter
>> M. If you saw .corn, would you realize it was about maize, rather
>> than being a normal .com? But it didn't get considered in
> identifying
>> variants.
>>
>> The Sharp S is the exception. The panel concluded that the Sharp S
>> (ß) and a double S (ss) are variants. Most variants are
>> bidirectional -- that is, it doesn't matter which one was registered
>> first, the other is blocked. But this case is different. If the
> name
>> with a double S is registered first, then the Sharp S is indeed
>> blocked. However, if the name with Sharp S is registered first,
> then
>> the variant is considered "allocatable." That is the same name with
> a
>> double S rather than Sharp S _can_ be registered, provided:
>> 1) ALL of the instances of Sharp S in the name (if there is more
> than
>> one) are changed to double S, and
>> 2) the name is registered to the same registrant.
>>
>> On the other hand, the possibility of substituting a vowel with
>> diaresis for the same vowel followed by E did not come up. That is
>> the way I learned German (as an American) long ago. But the native
>> German speakers on the Panel did not consider it worth worrying
> about.
>>
>>
>> Sorry if that doesn't totally clarify things. But that's all I've
> got
>> on the subject.
>>
>> Bill Jouris
>>
>> On Thursday, October 21, 2021, 11:47:20 PM PDT, Roberto Gaetano
>> <roberto_gaetano at hotmail.com> wrote:
>>
>> Dear Bill,
>>
>> I wonder whether I am off-topic with this question, but here it is
>> anyway.
>> Has the Latin GP considered an additional potential confusion coming
>> from cases like the german equivalency between “ae” and “ä”
>> or “ss” and “ß”? Just to make an example, the Austrian
>> Touring Club (ÖAMTC) has the site oeamtc.at [2], as it is customary
>> in german-speaking countries to get around the problem in this way.
>>
>> This is most probably out of scope, because the work is likely to be
>> limited to single characters and not combination of characters, but
>> from the user’s point of view it could be a source of confusion
>> anyway.
>>
>> Thanks,
>> Roberto
>>
>>> On 21.10.2021, at 19:47, Bill Jouris via CPWG <cpwg at icann.org>
>>> wrote:
>>>
>>> Dear Olivier,
>>>
>>> That is the problem I see as well. My sense is the both the Latin
>>> GP, and the Integration Panel (which is the next level higher)
>>> desire primarily to minimize the number of variants. Two
> codepoints
>>> which are identical, such as the Latin schwa and the Latin turned
> E,
>>> obviously cannot be distinguished by anyone, and so are necessarily
>>> variants. (Although one of my fellow Panel members argued against
>>> variant status even for that specific case.) But how strict the
>>> constraints were on making two codepoints reflects that desire for
>>> minimization. Also, in at least one case, the Integration Panel
>>> requested the Latin GP review (and modify) some variant findings
>>> because one set of codepoints which were variants of each other was
>>> "too large." ("Too large" wasn't defined. Nor was there
> indication
>>> of why one would care. Certainly it wouldn't impact the performance
>>> of the software doing the automatic filtering of proposed TLDs.)
>>>
>>> Given that
>>> a) the Panel members are experts,
>>> b) we were doing side-by-side comparisons, and
>>> c) we knew that we were looking at two different codepoints
>>> it seemed to me that if any of us couldn't tell the difference,
> then
>>> neither could the average user looking at a domain name in
>>> isolation. Setting a higher threshold seems to me like phishing,
>>> and especially pharming, enablement.
>>>
>>> It also might appear that having a group of codepoints which are
> not
>>> variants, but which users cannot really distinguish, provides a
>>> marketing opportunity. Not to sell to bad actors, who are
> typically
>>> one-off buyers and so not worth pursuing. But to sell defensive
>>> registrations to legitimate registrants, who merely want to make
>>> sure that their customers find them. Such defensive registrations
>>> would be likely to be renewed indefinitely, making them worthwhile
>>> even in a low margin business.**
>>>
>>> Bill
>>>
>>> ** 5 of the 7 members of the Latin Panel being employees of one or
>>> another of the contracted parties. I believe most of them were
>>> sincerely making a good faith effort to do the right thing. But
>>> their experience there may nevertheless have colored their
>>> perceptions.
>>>
>>> Sent from Yahoo Mail on Android [1]
>>>
>>> On Thu, Oct 21, 2021 at 2:13 AM, Olivier MJ Crépin-Leblond
>>> <ocl at gih.com> wrote:
>>>
>>> Dear Bill,
>>>
>>> thank you for explaining this in further detail. The problem I see
>>> with the process here, is that *experts* have been used to notice a
>>> difference. Because they are experts, they might be able to see
>>> differences which the average Internet end user will not. And this
>>> is the concern I have: is the panel of experts being conservative
>>> enough in making their decisions? If there is any suspicion about
>>> two characters being a variant, would a conservative approach them
>>> as variants?
>>> What is the end goal of identifying variants? If it is to avoid the
>>> use of IDNs for phishing, then the only approach possible should be
>>> a conservative approach.
>>> Kindest regards,
>>>
>>> Olivier
>>>
>>> On 21/10/2021 05:17, Bill Jouris via CPWG wrote:
>>>
>>> After some of the discussion in the chat in this morning's meeting,
>>> I feel like a little more extended discussion about variants might
>>> be helpful.
>>>
>>> The repertoire for the Latin script consists of "codepoints" --
> some
>>> are letters and some are letters plus diacritics. "Variants" are
>>> pairs of codepoints which are indistinguishable. That is, in the
>>> process that the Panel used, 5 of the 7 experts on the panel
>>> couldn't see a difference. The Latin GP did not look at diacritics
>>> per se. Just at codepoints which might involve diacritics.
>>>
>>> Thus, a codepoint consisting of a letter with a caron diacritic (
> ̌
>>> ) and a codepoint with the same letter combined with a breve
>>> diacritic ( ̆ ) may always result in a variant pair, but only
>>> because the Panel's comparison worked out that way. For example, a
>>> G with caron (ǧ) and a G with breve (ğ) are variants. On the
>>> other hand, a caron and a macron ( ¯ ) never result in a variant
>>> pair.
>>>
>>> However some cases with diacritics are mixed. For example, a
>>> codepoint consisting of letter with a dot above ( ˙ ) and a
>>> codepoint consisting of a letter with an acute accent results in a
>>> variant pair for letters C (ċ vs ć), N (ṅ vs ń), and Z (ż vs
>>> ź ). But, in the Panel's original finding, not for letters E (ė
> vs
>>> é), and I (i vs í).
>>>
>>> (Note that a majority of the Panel found the vowels to produce
>>> variants as well. Just not a supermajority, as required by the
>>> process the Panel had adopted. As a result, the Panel's official
>>> position is that, in various cases not just this one, even though a
>>> majority of the experts, looking side by side, could not see a
>>> difference, the average "reasonably careful user" will somehow
>>> magically notice the difference when looking at a domain name.)
>>>
>>> Then we have cross-script variants, including those identified by
>>> other Panels. For example, the Greek Panel found that the Greek
>>> letter Iota was a variant both of the Latin letter I and the Latin
>>> letter I with acute. As a result I and I with acute became
>>> variants.
>>>
>>> But there is no Greek letter which is a variant of the Latin letter
>>> E. So we are left with a situation where the dot above diacritic
>>> and the acute produce variants for all letters EXCEPT for the
> letter
>>> E. (When I suggested that, for consistency, we should make the
>>> letter E case a variant as well, the response was "It is more
>>> important that we follow our process than that we have
>>> consistency.")
>>>
>>> TLDs consist of a series of codepoints. Proposed TLDs which differ
>>> _only_ by one or more variants from another TLD will be
>>> automatically be rejected in the software. For example, .çom
>>> would be allowed, despite its similarity to .com, because C with
>>> Cedilla is not a variant of C. Also .сом (using Cyrillic
>>> letters) would be allowed because, while C and the Cyrillic letter
>>> Es are variants, and O and the Cyrillic letter O are variants, the
>>> letter M and the Cyrillic letter Em are not variants (the Panel was
>>> directed to ignore Upper Case when deciding what might confuse
>>> users). But .cóm could be rejected, because O and O with acute
> are
>>> variants.
>>>
>>> "Confusables" are pairs of codepoints which some for the experts
>>> could not distinguish, just not enough to be designated as
> variants.
>>> Confusables are intended as suggestions for the panel which will
>>> manually review the proposed TLDs.
>>>
>>> I hope this all will help everyone understand what we are looking
> at
>>> here.
>>>
>>> Regards,
>>> Bill Jouris
>>>
>>> _______________________________________________
>>> CPWG mailing list
>>> CPWG at icann.org
>>> https://mm.icann.org/mailman/listinfo/cpwg
>>>
>>> _______________________________________________
>>> By submitting your personal data, you consent to the processing of
>>> your personal data for purposes of subscribing to this mailing list
>>> accordance with the ICANN Privacy Policy
>>> (https://www.icann.org/privacy/policy) and the website Terms of
>>> Service (https://www.icann.org/privacy/tos). You can visit the
>>> Mailman link above to change your membership status or
>>> configuration, including unsubscribing, setting digest-style
>>> delivery or disabling delivery altogether (e.g., for a vacation),
>>> and so on.
>>>
>>> --
>>> Olivier MJ Crépin-Leblond, PhD
>>> http://www.gih.com/ocl.html
>>
>> _______________________________________________
>> CPWG mailing list
>> CPWG at icann.org
>> https://mm.icann.org/mailman/listinfo/cpwg
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of
>> your personal data for purposes of subscribing to this mailing list
>> accordance with the ICANN Privacy Policy
>> (https://www.icann.org/privacy/policy) and the website Terms of
>> Service (https://www.icann.org/privacy/tos). You can visit the
> Mailman
>> link above to change your membership status or configuration,
>> including unsubscribing, setting digest-style delivery or disabling
>> delivery altogether (e.g., for a vacation), and so on.
>>
>>
>>
>> Links:
>> ------
>> [1]
>>
> https://go.onelink.me/107872968?pid=InProduct&c=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers&af_wl=ym&af_sub1=Internal&af_sub2=Global_YGrowth&af_sub3=EmailSignature
>> [2] http://oeamtc.at
>
>> _______________________________________________
>> CPWG mailing list
>> CPWG at icann.org
>> https://mm.icann.org/mailman/listinfo/cpwg
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of
>> your personal data for purposes of subscribing to this mailing list
>> accordance with the ICANN Privacy Policy
>> (https://www.icann.org/privacy/policy) and the website Terms of
>> Service (https://www.icann.org/privacy/tos). You can visit the
> Mailman
>> link above to change your membership status or configuration,
>> including unsubscribing, setting digest-style delivery or disabling
>> delivery altogether (e.g., for a vacation), and so on.
More information about the CPWG
mailing list