[ChineseGP] proposal to eliminate the divergence between us

Thu Jan 8 11:01:17 UTC 2015

Dear Chris,

If I understand this correctly, here is an example from Arabic script which could be relevant:

U+06A9 (ک) and U+06AA (ڪ) are distinct letters (not variants) in Sindhi language (see http://www.omniglot.com/writing/sindhi.htm).  

However, these code points are considered variants of U+0643 (ك) by other language communities (e.g. Arabic language).  

Therefore, they are being considered as variants by the ArabicGP (see Table 3 in https://community.icann.org/download/attachments/47253587/Arabic%20Variant%20Analysis%20for%20LGR%200.8.pdf?version=2 <https://community.icann.org/download/attachments/47253587/Arabic%20Variant%20Analysis%20for%20LGR%200.8.pdf?version=2&modificationDate=1419700233000&api=v2> &modificationDate=1419700233000&api=v2).

Regards,
Sarmad

From: Dillon, Chris [mailto:c.dillon at ucl.ac.uk] 
Sent: Monday, January 05, 2015 3:09 PM
To: Wang Wei; yoshiro.yoneya at jprs.co.jp; hotta at jprs.co.jp
Cc: ChineseGP at icann.org; Sarmad Hussain
Subject: RE: [ChineseGP] proposal to eliminate the divergence between us

Dear colleagues,

新年快樂

明けましておめでとうございます

새 해 福 많이 바드세요

Or/. Happy New Year!

I am wondering whether there may be a way of making the proposal below work, without the JGP’s having to define variant sets and mappings (well, only a small number in scenario 2).

Scripts used by many languages, for example Cyrillic and Arabic (I’ll leave out Latin as it is used by so many languages it may cause confusion) may be in a situation where some implementations of the script define variants (cf. SC and TC) and some don’t (cf. Japanese). One possible approach could be that languages which don’t define variants inherit the variant sets and mappings from the languages using the script that do define variants. I’m copying Sarmad in on this email, as this is a phenomenon which may have occurred in the work of one of the other GPs.

I reckon this approach would work for cases 1, 3 and 4 below. (Actually 5 too as long as there are no examples of it…)

That only leaves us with cases in scenario 2 such as 栞 (a variant which only exists in the Japanese table) for which a mapping to 刊 and 刋 would need to be created. For all other cases, the SC/TC mappings would be inherited.

Regards,

Chris.

--

Research Associate in Linguistic Computing, Centre for Digital Humanities, UCL, Gower St, London WC1E 6BT Tel +44 20 7679 1599 (int 31599) www.ucl.ac.uk/dis/people/chrisdillon <http://www.ucl.ac.uk/dis/people/chrisdillon>  

From: chinesegp-bounces at icann.org <mailto:chinesegp-bounces at icann.org>  [mailto:chinesegp-bounces at icann.org] On Behalf Of Wang Wei
Sent: 29 December 2014 07:54
To: yoshiro.yoneya at jprs.co.jp <mailto:yoshiro.yoneya at jprs.co.jp> ; hotta at jprs.co.jp <mailto:hotta at jprs.co.jp> 
Cc: ChineseGP at icann.org <mailto:ChineseGP at icann.org> 
Subject: [ChineseGP] proposal to eliminate the divergence between us

Dear Yoneya San and Hotta San

Please kindly accept my belated but best wishes for the Christmas and new year.

Recently, we carried out the following works and I outlined them here for your comments:

For any Hanzi in CGP repertoire, it belong to a variant mapping set (minimum set size is 1 which means there is no variant for the code point) under the current rules borrowed from CDNC; and for any Kanji code point in JGP repertoire, it may also belong to some variant mapping set (we acknowledge that there is no variant in JPRS practice so far, but we assume that there will be a kind of variant mapping definition in JGP repertoire).

All the variant mapping sets can be divided into FIVE scenarios:

1)      the variant mapping set in JPRS ∈ variant mapping set in CDNC

In CGP 

愛 611B (0);爱(86),愛(886);愛(0),爱(0);

爱 7231 (0);爱(86),愛(886);愛(0),爱(0);

 In JGP 

愛611B(2,3);611B(2,3);

2)      the variant mapping set in CDNC ∈ the variant mapping set in JPRS

In CGP: 

刊520A (0);刊520A(86),刊520A(886);刊(0),刋(0); 

刋520B (0);刊520A(86),刊520A(886);刊(0),刋(0);

In JGP: 

刊 520A(2,3);520A(2,3);

刋 520B(2,3);520B(2,3);

栞 681E(2,3);681E(2,3); 

*: this example is ONLY an assumption

3)      the variant mapping set in CDNC = the variant mapping set in JPRS

In CGP 

一4E00 (0); 一4E00(86),一4E00(886); 一(0),壱(0),壹(0),弌(0); 

壱58F1 (0); 壹58F9(86),壹58F9(886); 一(0),壱(0),壹(0),弌(0); 

壹58F9 (0); 壹58F9(86),壹58F9(886); 一(0),壱(0),壹(0),弌(0); 

弌5F0C (0); 一4E00(86),一4E00(886); 一(0),壱(0),壹(0),弌(0); 

In JGP:

一 4E00(2,3);4E00(2,3);

壱 58F1(2,3);58F1(2,3);

壹 58F9(2,3);58F9(2,3); 

弌 5F0C(2,3);5F0C(2,3);

*: this example is ONLY an assumption

4)      the variant mapping set in CDNC ∩ the variant mapping set in JPRS = 0

The code point UNIQUELY exists in JGP table

辻8FBB(2,3);8FBB(2,3); 

5)      the variant mapping set in CDNC ∩ the variant mapping set in JPRS ≠ 0

and

the variant mapping set in CDNC ≠ the variant mapping set in JPRS

No specified example so far

\

In the past, we discussed the variants problem for many times, but mainly based on the two types: allocatable and blocked. However, we think another type ("out-of- repertoire") in the XML draft, may help the conflicted issue between JGP and CGP, which was recommended by Asmus' mail. 

The basic principle is "any variant label with a code point out-of-repertoire is invalid". We think this “out-of-repertoire” type and consequent “invalid” action will tremendously decrease the complexity of variant mapping coordination between us.

For scenario 1:

In CGP 

愛 611B (0);爱(86),愛(886);愛(0),爱(0);

爱 7231 (0);爱(86),愛(886);愛(0),爱(0);

In JGP 

愛 611B(2,3);611B(2,3);

JGP take爱 7231 into variant mapping set, but mark it as “out-of-repertoire” and take “invalid” action for WLG process, which means, 爱 7231 will never be generated into the labels.

JGP LGR:

<language>und-Jpan</language>

<char cp="611B" tag="sc:Hani">

    <var cp="611B" type="alloc" comment="identity" /> 

    <var cp="7231" type="out-of-repertoire-var" /> <!--Hans, JGP should exist.-->

</char>

WLE rules:

<action disposition="invalid" any-variant="out-of-repertoire-var" 

comment="any variant label with a code point out of repertoire is invalid"/>

<action disp="allocatable" all-variant="alloc"  />

CGP LGR:

<language>und-Hani</language>

<char cp="611B" tag="sc:Hani">

    <var cp="611B" type="trad" comment="identity" /> <!-- Jpan -->

    <var cp="7231" type="simp" />

</char>

<char cp="7231" tag="sc:Hani">

    <var cp="611B" type="trad" /> <!-- Jpan -->

    <var cp="7231" type="simp" comment="identity" />

</char>

WLE rules:

         <action disp="blocked" any-variant="block" />

         <action disp="allocatable" only-variants="simp both" />

         <action disp="allocatable" only-variants="trad both" />

         <action disp="blocked" any-variant="simp trad" />

         <action disp="allocatable" comment="catch-all" />

For scenario 2:

In CGP: 

刊520A (0);刊520A(86),刊520A(886);刊(0),刋(0); 

刋520B (0);刊520A(86),刊520A(886);刊(0),刋(0);

In JGP: 

刊 520A(2,3);520A(2,3);

刋 520B(2,3);520B(2,3);

栞 681E(2,3);681E(2,3); 

Now it is CGP’s turn to take栞 681E into variant mapping set, but mark it as “out-of-repertoire” and take “invalid” action for WLG process, which means, 栞 681E will never be generated into the labels.

CGP LGR

<language>und-Hani</language>

<char cp="520A" tag="sc:Hani">

    <var cp="520A" type="both" comment="identity" />

    <var cp="520B" type="block" />

    <var cp="681E" type="out-of-repertoire-var" /> <!-- Jpan -->

</char>

<char cp="520B" tag="sc:Hani">

    <var cp="520A" type="both" />

    <var cp="520B" type="block" comment="identity" />

    <var cp="681E" type="out-of-repertoire-var" /> <!-- Jpan -->

</char>

<char cp="681E" tag="sc:Hani"> <!-- Jpan -->

    <var cp="520A" type="block" />

    <var cp="520B" type="block" />

    <var cp="681E" type="out-of-repertoire-var" comment="identity"/> 

</char>

WLE rules:

         <action disp="invalid" any-variant="out-of-repertoire-var" 

comment="any variant label with a code point out of repertoire is invalid"/>

         <action disp="blocked" any-variant="block" />

         <action disp="allocatable" only-variants="simp both" />

         <action disp="allocatable" only-variants="trad both" />

         <action disp="blocked" any-variant="simp trad" />

         <action disp="allocatable" comment="catch-all" />

JGP LGR:

<language>und-Jpan</language>

    <char cp="520A" tag="sc:Hani">

    <var cp="520A" type="alloc" comment="identity" />

    <var cp="520B" type="block" />

    <var cp="681E" type="block" />

</char>

<char cp="520B" tag="sc:Hani">

    <var cp="520A" type="block" />

    <var cp="520B" type="alloc" comment="identity" />

    <var cp="681E" type="block" /> 

</char>

<char cp="681E" tag="sc:Hani">

    <var cp="520A" type="block" />

    <var cp="520B" type="block" />

    <var cp="681E" type="alloc" comment="identity"/> 

</char>

WLE rules:

 <action disp="blocked" any-variant="block" />

 <action disp="allocatable" all-variant="alloc"  />

For Scenario 3:

In CGP 

一4E00 (0); 一4E00(86),一4E00(886); 一(0),壱(0),壹(0),弌(0); 

壱58F1 (0); 壹58F9(86),壹58F9(886); 一(0),壱(0),壹(0),弌(0); 

壹58F9 (0); 壹58F9(86),壹58F9(886); 一(0),壱(0),壹(0),弌(0); 

弌5F0C (0); 一4E00(86),一4E00(886); 一(0),壱(0),壹(0),弌(0); 

In JGP:

一 4E00(2,3);4E00(2,3); 

壱 58F1(2,3);58F1(2,3);

壹 58F9(2,3);58F9(2,3); 

弌 5F0C(2,3);5F0C(2,3); 

JGP needs to create its own mapping set including all above 4 code points and corresponding rules, otherwise, it will fall into scenario 1..

For Scenario 4:

Like UNIQUE code point ONLY exists in JGP table

 辻8FBB(2,3);8FBB(2,3); 

CGP probably will not include this code point into its repertoire.

No extra work or rule are needed.

For Scenario 5:

Actually, we have not find the code points which fit into this scenario.

But the solution will refer to scenario 1 or 2, like:

For JGP, “C” will be included but marked as “out-of-repertoire”

For CGP, “A” will be included but marked as “out-of-repertoire”

In conclusion, “out-of –repertoire type” and “invalid action” provide us a conservative and simple way to reach a consensus for the variant mapping and rules.

According to our analysis on CGP table and JPRS table

There are 4983 code points fit for Scenario 1

There are 840 code points fit for Scenario 3

There are 170 code points fit for Scenario 4

Since JGP has not decided yet if variant relationship exist in JGP repertoire, we don’t have analytical number about scenario 3 and scenario 5. But what we believe is that the above solution can also be applied for scenario 3 and 5 no matter what kind of variant mapping JGP will produce.

All above is our proposal for settle the divergence at minimum cost for both of us.

What do you think about it? Looking forward for your reply.

Best Regards,

Wei Wang

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/chinesegp/attachments/20150108/675a7bd2/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 3187 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/chinesegp/attachments/20150108/675a7bd2/image001-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 3241 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/chinesegp/attachments/20150108/675a7bd2/image002-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 2517 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/chinesegp/attachments/20150108/675a7bd2/image003-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.jpg
Type: image/jpeg
Size: 2747 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/chinesegp/attachments/20150108/675a7bd2/image004-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.jpg
Type: image/jpeg
Size: 4082 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/chinesegp/attachments/20150108/675a7bd2/image005-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.jpg
Type: image/jpeg
Size: 2754 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/chinesegp/attachments/20150108/675a7bd2/image006-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5118 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/chinesegp/attachments/20150108/675a7bd2/smime-0001.p7s>