[UA-discuss] FW:I-D Action: draft-klensin-idna-rfc5891bis-00.txt

Mon Mar 13 16:40:03 UTC 2017

On 3/13/2017 7:46 AM, nalini.elkins at insidethestack.com wrote:
> It is an interesting problem.   For example, we took one 6 character name of a business which is trademarked & ran it through my algorithm, we came up with over 1 million possible permutations.  This is because you can use more than one character look-alike.
>
> Lest you think that this doesn't happen, we have already found names registered which use more than one confusable.  And, I have only just started my testing.

I keep coming back to the concept of "perceptual distance".

When you look at individual code points and call them "confusable" you 
assert that each such pair has a perceptual distance that is small 
enough to fit below a certain threshold, but that threshold is not zero, 
nor is the actual perceptual distance between most code points that are 
considered confuable.

There are two interesting issues with this.

One is that that you may have two pairs of confusables that have one 
common member, but the other two members are far enough apart in 
perceptual distance to no longer meet your threshold of confusability.

The other is that code points by themselves are really not relevant for 
this, because the real metric should be the perceptual distance between 
labels (or even FQDNs).

Just based on labels: if you simultaneously substitute more than one 
code point in a label with a potential confusable, the result may be 
that the label is now further apart in perceptual space from the 
original label than if you had only substituted one code point at a 
time. The reason is that people read words, and having a single code 
point altered may not interfere with the process of reading that word, 
but once you change two or more, the situation is different.

As a result, I would tentatively conclude that your claim that those 1 
million permutations are all equally confusable with the original label 
is likely specious. It is reasonable to suspect that a good portion of 
those labels would look distinctly "odd", if your substitution is based 
on ordinary single-code-point confusability thresholds.

That said, there are some labels in certain scripts for which the 
variant code points are true equivalents (whether visual, phonetic or 
semantic). In those cases, making multiple substitutions can result in 
rather large multiplicities of fully equivalent labels.

(Note, that if one starts with 0 distance, or almost 0 distance, in 
perceptual space, then even multiple substitutions can be expected to 
result in negligible perceptual distance between variant labels --- but 
that is not usually the case for the kinds of instances considered under 
"confusable").

Finally, in considering labels, you'll pick up the 'rn' vs. 'm' issue: 
that is, confusables are not 1:1 in code point space, they may  be 1:n 
or even n:m. (The same is true for variants that represent true 
equivalence).

A./

PS: a test whether variants are true equivalence is whether they satisfy 
not only symmetry but transitivity. Anything with a measurable 
perceptual distance is likely not transitive; just think of two labels 
that are (barely) not confusable and now imagine an "average shape" 
label. The latter would be confusable with both, and therefore all three 
would not obey the transitivity constraint.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/ua-discuss/attachments/20170313/22d31734/attachment.html>