[vip] Suggested meta-questions to think about

Daniel Kalchev daniel at digsys.bg
Tue Jun 21 03:14:33 UTC 2011


Patrick,

You have good summary of the terminology/topic definition problem we stumble upon.

Here are my comments, biased towards cyrillic and Bulgarian (I claim to have sufficient understanding of it as such):

On Jun 20, 2011, at 14:44 , Patrik Fältström wrote:

> A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category.

This is a problem with how Unicode was defined. The initial definition of Unicode could be simplistically described as "here is a bunch of code tables that  various companies, scripts, languages, countries have designed and use, let's stick them together in a common larger table and call it Unicode". Later, and recently, Unicode has improved, but not much for some "stable" scripts such as Cyrillic. It just remains a collection of interlinked (independent) code tables. Probably there is no real other solution anyway.

One example, related to Bulgarian/Cyrillic: as you remember, the Bulgarian IDN ccTLD application was refused on the basis of alleged display similarity with (as someone anonymous suggested) .br. This has come to a shock to our community and experts, because there is really not any similarity between those letters.

We have done some research afterwards and found out, that this really is a font confusion issue. In Bulgaria, the Cyrillic script is presented to children and students using much different graphical representation, than found in some computer fonts. We have therefore asked experts and they identified an Unicode character that graphically represent the 'cyrillic ghe' properly. It is included in the Cyrillic script in Unicode 6.0 (we can discuss the technical details separately). Our experts say there are 11 characters in Bulgarian/Cyrillic that needs fixin in Unicode.

However, the Unicode modification process is (very) slow and it is near non-realistic to expect that those two characters will be marked as 'equivalent' any time soon.

The font issue is very big and it relates to 'variants' very much. We need to address it somehow.

> A.2. Two different spellings of the same word in the same script and same language, like color/colour.

This brings up the main issue I have with 'variants'. Again, my take is on Cyrillic. In my opinion, in Cyrillic the variants are label (word) based and not character based. In the Bulgarian language, there is direct 1:1 relationship between how an word is pronounced and how it is spelled. The Cyrillic script was originally designed so that it has this attribute. Therefore, there is rarely such cases. Where they exist, they are well known.

I understand it has been a perception for years that we look at character variants, but for many scripts, there rarely have any meaning. Perhaps we should review this in depth for each individual script and possible threat different scripts differently.

> A.3. Same word in the same language in two different scripts (bulgarian)

There is only one script for Bulgarian: Cyrillic. Will love to learn your source of this information.

There are however issues as you describe with Serbian, where both Latin and Cyrillic are official scripts and there are examples of distorted words written in both scripts (we are not looking into this), or the same word written in both scripts, which is very common. Probably with other languages/scripts as well.

> A.4. Same word in two different languages

I don't believe our small community can resolve this or even touch the subject below the surface. Especially, given the extremely short time frame.


> 
> And then there are many A.1.1, A.1.2, A.2.1 etc, and I did even hear today people say "two variants are two different accepted spellings of the same word that _sound_ the same". I do not even know where to put that.
> 

You do not speak Bulgarian, nor you write in Cyrillic. If you did, this would make perfect sense to you.

I could imagine this issue is present in other non-Latin based scripts as well -- or even in some Latin based languages.


To sum it up, your questions make me think you too consider variants to be label (word) based.

There is one more case. In Russian language for example, there is the E character with two dots. With mass usage of computers it was largely ignored and now many write the words that originally contained this character with the "ordinary" E. Russian speaking people can recognize these cases and read the "ordinary" E, as they read the E with dots, because the human brain actually reads the word, not the characters. 

We should consider and probably document many cases of such word simplification that came with the mass introduction of computers in our life.

Now, there are Russian words that contain both the E with dots and the "normal" E. In these words, it is not appropriate to use these characters interchangeably, which is yet another proof, that variants are word based.

Finally, the E with dots is not going to go away, so we cannot just ignore the issue and suggest everyone should stop using it in this Internet era. My observation is that in recent years, as computers because more potent, humans started to make less sacrifices with their languages/scripts.

> B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application?
> B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")?
> 

As far as I understood, our study groups are not to touch any policy issues. There questions are very interesting and important, and must be considered in detail, but probably outside of the scope of this study. Especially considering the short time frame.

Another comment I have here is that the bias of our study seems to be towards TLD variants. It is my belief, that our work will be extremely useful at any DNS level, especially as it comes to your next question. DNS is to be treated the same at each level. 

Therefore, I believe we need to drop the "TLD" part of the terms and threat all labels generically -- any variant application at TLD level might be subject to additional policy if we do not have good working technical design/implementation.

> And then there might be a technical question in there...
> C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations?
> 

There can be many things, that can be done. Perhaps the most significant, to amend DNS to use Unicode. Or plug in IDNA (keep the protocol simple and stable, as Unicode is far from simple and stable).

In my opinion, only the protocol, technical solution makes sense in the long run for implementing the variants. Anyway, in any case what is a variant, how to identify it, how to keep the list of variants up to date (languages change) etc is to be worked out first.

Daniel Kalchev
Register.BG




More information about the vip mailing list