[Latingp] Variant cross-script analysis worksheets

Wed May 30 17:52:35 UTC 2018

Dear Meikal, et al

The meeting minutes<https://docs.google.com/document/d/13KIVQlkHYc6_ib1ZDSbxRcLVlSHnhAVvc99ZrGwOxT0/edit> may not answer all the questions, but does provide a light into what transpired during the Brussels workshop and what the panel members agreed on. The following Principles<https://docs.google.com/document/d/1IrT_kfildf1SumYUqjkaIkMT-TYx9IRqtuPMV4YvKXU/edit?usp=sharing> document (still in development) is based on those decisions in Brussels.

-Dennis

From: Meikal Mumin <meikal.mumin at uni-koeln.de>
Date: Tuesday, May 29, 2018 at 10:14 AM
To: Bill Jouris <bill.jouris at insidethestack.com>, Dennis Tan Tanaka <dtantanaka at verisign.com>, Michael Bauland <Michael.Bauland at knipp.de>, Sarmad Hussain <sarmad.hussain at icann.org>
Cc: Latin GP <LatinGP at icann.org>
Subject: [EXTERNAL] Re: [Latingp] Variant cross-script analysis worksheets

Dear colleagues,

I thought I'd try to take up this thread again after some silence. Hopefully Michael is back from a nice holiday and could chime in on the discussion too (I think he might have not received Sarmad's earlier email).

Obviously I can not say what was decided in Brussels, since I could not join the group, and that is why I had tried to put a question to our subgroup.

I think Sarmad has provided us with nearly all additional references we should consider as guidance on how to approach this highly complex task. My conclusion is that it is more complex than reducing things to "homoglyphs" but I do not think that (at least linguistically) we have a strong definition of homoglyphs, which would clearly set them apart from confusables, near-confusables, near-homoglyphs (and all the other terms we may have used to find an understanding of one another and the issues at hand).

Regarding the 'minority report' by Bill - Skimming over it I thought it would form an excellent basis for our chapter on variants and this work should not got to waste in my eyes. Equally, we can re-use Sarmad's summary and expand it to integrate it into the introduction of the variant section of the proposal. As I tried to argue in the last tele-conference, I believe it is important that we present not only the results or outcome of our work, but also the way we took to arrive at it, which means we have to discuss - at least briefly - the different considerations guiding our work.

As a pragmatic step I would suggest we continue for the moment with the very useful tables Dennis created, adding those few additional variant pairs I had suggested in comments. I don't think it would be too much overhead including them in our 2-pass review, and if both reviewers happen to come to the same conclusion that those potential variant pairs have a 3-5 rating - that is that they are in-fact not variants - we also do not need to have a theoretical discussion on the difference between homoglyphs, near-homoglyphs, confusables, etc. In this way, our decision would be driven by decisions based on a careful analysis of the data, rather than any a before-hand conceptions on what the categorical relationship exists between some of these code-points, that is an a posteriori rather than a priori analysis if you so will.

I hope this is helpful but let's keep up the discussion. I think we were making good progress with the tables and the 1-5 rating scale (rather than a binary choice).

Best wishes,

Meikal

On 19 May 2018 at 05:10, Sarmad Hussain <sarmad.hussain at icann.org<mailto:sarmad.hussain at icann.org>> wrote:
Dear All,

This is indeed a complex matter to address, and is therefore requiring this continued discussion.  It may also be useful here to refer back to the RZ-LGR Procedure<https://www.icann.org/en/system/files/files/lgr-procedure-20mar13-en.pdf>.

The RZ-LGR Procedure, while defining “IDN variants” says that:
·  “An IDN variant, as understood here, is an alternate code point (or sequence of code points) that could be substituted for a code point (or sequence of code points) in a candidate label to create a variant label that is considered the “same” in some measure by a given community of Internet users.”

However, the Procedure also acknowledges immediately following the definition that:
·  “There is not general agreement of what that sameness requires, and many of the things people seem to want from that sameness are not technically achievable.”

While noting the benefits of defining IDN variants, the procedure also acknowledges the limitations.
·  “The primary benefit of the LGR process is as a mechanism that delivers hands-off evaluation for these aspects.
·  “By doing so, the process may not be able to replace case-by-case analysis altogether: there will still be a role for additional types of review, such as for String Similarity, and which are not included in the LGR process.”
So, not all matters can be settled in the LGR.  A line has to be drawn between “same” and “similar”.

The LGR Procedure does note what is desirable to be in the scope to LGR:
·  “the LGR process is designed to clear the table of all the straightforward, non-subjective cases, mainly by returning a “blocked” disposition.
·  “Even for variants based on visual similarity, there exists a subset of evaluation rules that could be applied in an automated manner, obviating the need for further case-by case or even contextual review.”

But notes that this should not go too far into the string similarity discussion:
·  “While the process described here could be expanded to address cases of visual similarity, that is not the primary intention”
·  “Finally, in investigating the possible variant relations, Generation Panels should ignore cases where the relation is based exclusively on aspects of visual similarity.”

One could infer from these statements in the RZ-LGR Procedure that:
1.       If two code points are considered “same” by the user community, these should be included as IDN variants (this is not limited to visual similarity, but could also include semantic equivalence, like in Chinese, orthographic conventions or spelling simplification, like in Arabic, homophonic relations, like in Ethiopic, etc., as determined the respective script community)
2.       The “straightforward, non-subjective cases” of visual similarity could be included as IDN variants and blocked
3.       Beyond these, the analysis goes into the realm of string similarity review, which is beyond the intention of the LGR

Generation Panels have been asked to draw the line based on these guidelines provided in the RZ-LGR Procedure.  For example, Cyrillic GP agreed to consider homoglyph relations with other related scripts for this purpose.  Neo-Brahmi GP has used a slightly different technique, where it considers cross-script variants those code points which members of both scripts in question find such code points “indistinguishable” even if these are not homoglyphs (see the blog<https://www.icann.org/news/blog/the-south-asian-eleven-progress-on-supporting-idns-in-scripts-from-the-region> for some more details).

Of course, the Latin GP also needs to draw these lines for the analysis for identifying within-script and cross-script IDN variant cases.

Regards,
Sarmad

From: Latingp [mailto:latingp-bounces at icann.org<mailto:latingp-bounces at icann.org>] On Behalf Of Bill Jouris
Sent: Saturday, May 19, 2018 5:28 AM
To: Tan Tanaka, Dennis <dtantanaka at verisign.com<mailto:dtantanaka at verisign.com>>; Meikal Mumin <meikal at mumin.de<mailto:meikal at mumin.de>>
Cc: Tan Tanaka, Dennis via Latingp <latingp at icann.org<mailto:latingp at icann.org>>
Subject: Re: [Latingp] Variant cross-script analysis worksheets

It's been clear for some time, even before Brussels, that you think we should only look at homoglyphs.  (Also that you don't think that there are any in-script homoglyphs.  See the discussion about the schwa and the turned e.)

But there is a world of difference between agreeing, and merely deciding not to waste time arguing with a closed mind.  Which, for me, is what happened in the discussion in Brussels.

Bill Jouris
Inside Products
bill.jouris at insidethestack.com<mailto:bill.jouris at insidethestack.com>
831-659-8360
925-855-9512 (direct)

________________________________
From: "Tan Tanaka, Dennis" <dtantanaka at verisign.com<mailto:dtantanaka at verisign.com>>
To: Bill Jouris <bill.jouris at insidethestack.com<mailto:bill.jouris at insidethestack.com>>; Meikal Mumin <meikal at mumin.de<mailto:meikal at mumin.de>>
Cc: Michael Bauland <Michael.Bauland at knipp.de<mailto:Michael.Bauland at knipp.de>>; "Tan Tanaka, Dennis via Latingp" <latingp at icann.org<mailto:latingp at icann.org>>
Sent: Friday, May 18, 2018 1:43 PM
Subject: Re: [Latingp] Variant cross-script analysis worksheets

I believe we delimited the scope of variants for the Latin script in the face to face meeting in Brussels, did we not?

From: Bill Jouris <bill.jouris at insidethestack.com<mailto:bill.jouris at insidethestack.com>>
Reply-To: Bill Jouris <bill.jouris at insidethestack.com<mailto:bill.jouris at insidethestack.com>>
Date: Friday, May 18, 2018 at 2:18 PM
To: Dennis Tan Tanaka <dtantanaka at verisign.com<mailto:dtantanaka at verisign.com>>, Meikal Mumin <meikal at mumin.de<mailto:meikal at mumin.de>>
Cc: Michael Bauland <Michael.Bauland at knipp.de<mailto:Michael.Bauland at knipp.de>>, "Tan Tanaka, Dennis via Latingp" <latingp at icann.org<mailto:latingp at icann.org>>
Subject: [EXTERNAL] Re: [Latingp] Variant cross-script analysis worksheets

It is pretty clear, if one reads the MSR-3 document, that we are supposed to deal with Variants.  Which include, but are NOT limited to, homoglyphs.

Bill Jouris
Inside Products
bill.jouris at insidethestack.com<mailto:bill.jouris at insidethestack.com>
831-659-8360
925-855-9512 (direct)

________________________________
From: "Tan Tanaka, Dennis" <dtantanaka at verisign.com<mailto:dtantanaka at verisign.com>>
To: Meikal Mumin <meikal at mumin.de<mailto:meikal at mumin.de>>
Cc: "bill.jouris at insidethestack.com<mailto:bill.jouris at insidethestack.com>" <bill.jouris at insidethestack.com<mailto:bill.jouris at insidethestack.com>>; Michael Bauland <Michael.Bauland at knipp.de<mailto:Michael.Bauland at knipp.de>>; "Tan Tanaka, Dennis via Latingp" <latingp at icann.org<mailto:latingp at icann.org>>
Sent: Friday, May 18, 2018 10:20 AM
Subject: Re: [Latingp] Variant cross-script analysis worksheets

we must deal with such confusable characters or sequences of characters in the context of variants

No, we don’t. Confusability is not in scope. We established the Latin panel will deal with homoglyphs or nearly homoglyphs (i.e. font variation) in the context of cross-scripts.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/latingp/attachments/20180530/ecee036f/attachment-0001.html>