[vip] Descriptive terminology

Cary Karp ck at nic.museum
Sat Sep 3 11:08:57 UTC 2011


Quoting Patrik (under a different subject heading):

> One thing we also have are two words spelled the same, but pronounced
> differently that means different things.
>
> Example (in Swedish):
>
> kista [chi:sta] : A suburb of Stockholm, where lots of IT industry is located
>
> kista [chista]  : A coffin

This is a perfect illustration of the concern that attaches to
homographs. The textbook definition of that term is, "two different
words in a language that are spelled the same." If we clarify that in
the basic terms of our own discussion, we might add, "written with the
same sequence of abstract characters and instantiated with the same
sequence of glyphs." The two words Patrik uses are normally
disambiguated by an upper-case initial letter in the first of them --
"Kista” -- but that device is not available in IDNA2008. It could
otherwise be argued that the upper-case distinction means that "Kista”
and "kista” are not true homographs in the textbook sense, but in the
discussion of IDNs the extra degree of freedom is useful.

Since a label has no intrinsic attribute of language and there is no
protocol restriction on the number of scripts that may appear in it, it
is also be possible to write "kista” using a Cyrillic, rather than a
Latin final letter. That gives "kistа”, and since the CYRILLIC SMALL
LETTER A and the LATIN SMALL LETTER A are commonly represented with the
same glyph, the Swedish and the hybrid strings are visually identical.
Not confusable -- identical. If we are comfortable in freeing the term
"homograph” from the requirement that it applies to words in the same
language, and are further willing to drop the requirement of the objects
of comparison being words at all, then "kista” and "kistа” may also be
termed homographs.

This new sense of that term has become deeply entrenched in the
discussion of IDNs but I would like to call it into question. By using
it as a general descriptor for several different forms of the variance
that we are addressing, we are obscuring pivotal distinctions among
them. I urgently suggest that we expand our descriptive terminology with
the term "homoglyph” to designate situations such as the one used in the
Cyrillic/Latin illustration above.

Two sequences of identical glyphs used to represent different sequences
of code points can and do appear in the IDN space. There is no attribute
either of script uniformity or language imposed on them. The separate
labeling of them as homoglyphs allows for their immediate
differentiation from cases where there really is a homographic concern
in the accepted textbook sense.

Establishing this distinction may prove of particular utility when
focusing on what may be the most urgent issue confronting us. That is
the one that arises when two users with identically labeled keyboards,
typing the same sequence of abstract characters, producing the same
sequence of displayed glyphs, have nonetheless generated two different
sequences of code points.

/Cary

/Cary


More information about the vip mailing list