[Latingp] From IP: Diacritics below a security risk?

Thu Aug 30 00:20:07 UTC 2018

I take a couple of things from this letter from the IP: 
- First, the IP considers security to be a major priority in our consideration of variants and/or confusibles. 
- In pursuit of that end, the IP considers it important whether users will be able to distinguish between code points.  Especially when presented with only one code point, not two alternatives next to each other for comparison.  The IP's note talks specifically about diacritics below the line.  But those are hardly the only cases where this is an issue.

Among languages written using the Latin script, pretty much all of them use the basic 26 letters (codepoints 0061 - 007A).  But we have identified over 100 additional glyphs in our repertoire.  All of those have one characteristic in common: The vast majority of Internet users are not familiar with a language which use them.  Because they are not familiar with those code points, they will have a challenge distinguishing between them.  

Consider, just by way of example, these 4 code points:- ă (0103) 
- ǎ (01CE) 
 - ā (0101)- ã (00E3)
Someone who is not familiar with more than one of these will, inevitably, perceive the one that he is familiar with whenever presented with any of the 4.  People see what they expect to see, what is familiar.  That happens even when they might be physically capable of distinguishing between two points IF they were presented with them side-by-side.  Because, in the kind of phishing attack discussed in the e-mail, they aren't presented with two options.  They are presented by something that looks like what they are expecting to see, and don't see anything sufficiently amiss to doubt it.  

A similar exercise can be done with 
- è (00E8) 
- é (00E9)
- ė (0117)
Someone who is familiar with two of them can probably distinguish between those two.  But someone who (like a far larger number of Internet users) is only familiar with one will see what he knows.  And someone who (like the vast majority of Internet users who use the Latin script) is not familiar with any of them will only notice that there is something above the letter -- but not at all what that something is. 

This is not to say that every letter plus diacritic is a variant of that same letter with any other diacritic.  Far from it.  But what we do have is something like this.  Take the letter A.  Our repertoire includes 24 variations, which gives almost 290 pairs.  Of those, one (0103 and 01CE) cannot be distinguished by eye at normal type sizes.  But another 51 are close enough that a normal user (i.e. someone who is not a trained linguist, who has not spent the last two years immersed in the various Latin script code points) will readily mistake one for another.  

Now one might argue that all of those should be left as Confusibles.  (Ignoring that detail that there are an additional 170 or so pairs which really are confusibles.)  But what, exactly, is the benefit of refusing to acknowledge that they are variants? 

I confess I cannot see any benefit to the Internet user community of drastically constricting what we consider a variant.  The closest thing to a benefit that I can see is that, by declining to spend the time actually checking the various pairs, we can finish sooner.  But that is a benefit to us as individuals who are on the Latin GP; it isn't a benefit to the people who will be using the results of our efforts. 

 Bill Jouris
Inside Products
bill.jouris at insidethestack.com
831-659-8360
925-855-9512 (direct)

   From: Sarmad Hussain <sarmad.hussain at icann.org>
 To: Latin GP <latingp at icann.org> 
 Sent: Tuesday, August 28, 2018 11:58 PM
 Subject: [Latingp] From IP: Diacritics below a security risk?

 Dear Latin GP members 
Kindly find below some feedback from IP for your consideration.
Regards Sarmad 

TO: LatinGP
FROM: IP

There are recent and widely published examples of phishing attacks using Latin IDNs in which the key features involved were diacritics below the letter. Here is an example:  Of all diacritics, diacritics below can be difficult to distinguish or be prone to clipping -- there is less space below the baseline than between the typical lowercase glyph and the top of the line. The example given above shows a further interaction with URL underlining - and not all display engines actually do as nice a job interrupting the underline as in the screen shot above. For example, here is how one system will render this (using a designated UI font - Segoe UI):  Note, this code point (U+1E33) is in the MSR as is (U+1E35 LATIN SMALL K WITH LINE BELOW).  The second example contains U+1E35 --  while the effect does not show equally at all type sizes, from 12pt and below the LINE BELOW is reliably hidden. Here are the two examples at 10pt  The issue is not limited to "K". We see "B", "D", "L" and "N" with both DOT and LINE BELOW and "M" and "H" with DOT BELOW, all on the same page in the MSR.It can be argued users have no working understanding of typography and would not reliably interpret small gaps or bulges in the underline as being related to an unfamiliar code point. This appears to make all diacritics below security-sensitive, however, the initial determination belongs to the relevant GPs. Note by the way that the Devanagari LGR treatssequences containing NUKTA (a dot below) as variants in at least some cases and recent community comments for that script are calling for more variant sequences. However, while the feature is graphically analog (dot below), each script works differently and there is no single a-priori solution.
The IP would like to encourage the LatinGP (and any other GP facing cases like this) to explicitly examine this example and other cases like it, where code points can become indistinguishable in common usage scenarios for IDNs, and formally conclude whether and how to take these into account when designing their LGR.   At this point, the IP would expect the GP to: * explicitly discuss this and other scenarios like it * evaluate whether they constitute a security risk to the Root Zone * come up with a reasoned decision as to whether and how to address them in the design of the Latin GP; and finally * document both the decision and its rationale.In coming to a decision, the GP may resolve:1) to make them variants2) to list them for attention as confusable
3) to take no action, because the GP feels that they do not represent a special security risk.As part of the review of the Latin LGR, the IP will look at the background and rationale offered by the Latin GP in coming to its conclusion; note that if the IP feels that the facts considered and rationale documented do not support the conclusion reached by the GP it may raise objections at that time.
_______________________________________________
Latingp mailing list
Latingp at icann.org
https://mm.icann.org/mailman/listinfo/latingp

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/latingp/attachments/20180830/4092c10d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 877 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/latingp/attachments/20180830/4092c10d/image002-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 131066 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/latingp/attachments/20180830/4092c10d/image001-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 776 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/latingp/attachments/20180830/4092c10d/image004-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 938 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/latingp/attachments/20180830/4092c10d/image003-0001.png>