[Neobrahmigp] [Ext] Re: FW: Corpus review for punjabi

Sat May 12 16:59:57 UTC 2018

Thank you Dr. Lehal for such an elaborate analysis and feedback.  

The analysis shows that the label level rules proposed for the Gurmukhi script are working as intended.  

We will pass this feedback to the integration panel.

Regards,
Sarmad

From: Dr. G. S. Lehal (ਗੁਰਪ੍ਰੀਤ ਸਿੰਘ ਲਹਿਲ) [mailto:gslehal at gmail.com] 
Sent: Saturday, May 12, 2018 1:18 AM
To: Sarmad Hussain <sarmad.hussain at icann.org>; Dr. AJAY D A T A <ajay at data.in>
Cc: neo brahmi <neobrahmigp at icann.org>; Pitinan Kooarmornpatana <pitinan.koo at icann.org>
Subject: [Ext] Re: FW: Corpus review for punjabi

Hello all,

I had a detailed look at the invalid labels for the rule (Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN) . This rule corresponds to wrong usage of Gurmukhi addak ੱ (0A71). The main reasons for these invalid labels, I observed are:

1.     Typing mistake

2.     Typist not sure about where to use addak and where not to use it according to Gurmukhi script rules. Actually many Punjabi users are confused about it, which results in wrong labels being generated.

The main typing mistakes I observed in the corpus are:

1.     Two consective addaks (Not allowed in Gurmukhi script) 

Example ਉੱੱਚ‎ (0A09 0A71 0A71 0A1A), ‎ਉਪਲੱੱਬਧ‎ (0A09 0A2A 0A32 0A71 0A71 0A2C 0A27) ‎ਅੱੱਡਾ‎ (0A05 0A71 0A71 0A21 0A3E)

2.     Swapping addak with preceding matra. Examples 

ਚੱੈਸ‎ (0A1A 0A71 0A48 0A38)

‎ਪ੍ਰੱੈਸ‎ (0A2A 0A4D 0A30 0A71 0A48 0A38)

3.     Putting addak at end of a word (does not make sense as addak used to geminate sound of following consonant). Two such examples in corpus are: ‎ਉੱ‎ (0A09 0A71) and ‎ਉਪਲੱ‎ (0A09 0A2A 0A32 0A71)

The mistakes related to usage of addak, which I observed in the corpus are:

1.    ‎ Wrong usage of addak (Many typists are not aware where to put addak and where not to put it. They have put addak after any short vowel without knowing if the following consonant can be geminated. Some such example found in corpus are  ‎ਉਜੱੜ‎ (0A09 0A1C 0A71 0A5C), ‎ਉਡੱਣ‎ (0A09 0A21 0A71 0A23) and ‎ਉਪਲੱਹਧ‎ (0A09 0A2A 0A32 0A71 0A39 0A27)

2.     ‎Addak followed by a long vowel : According to Gurmukhi rules, addak has to be followed by a specific set of consonants only and NOT with any vowel. But there are few instances where it was followed by long vowel ਈ (0A08) ਆ (0A06) making it an invalid label in the corpus. Some examples are: ਉਸਰੱਈਏ‎ (0A09 0A38 0A30 0A71 0A08 0A0F). ‎ਉਰੱਈ‎ (0A09 0A30 0A71 0A08) and ‎ਅਤਿੱਆਚਾਰ‎ (0A05 0A24 0A3F 0A71 0A06 0A1A 0A3E 0A30)

3.     Writing Addak after a long vowel. Addak is not allowed  to be written after most of the long vowels, but many typists who are not fluent in Punjabi, place it after such vowels resulting in generation of invalid labels. Two examples from the corpus are : ਊੱਠਣੀ‎ (0A0A 0A71 0A20 0A23 0A40) and ਓੱਪੋ‎ (0A13 0A71 0A2A 0A4B)

4.     Addak followed by matra kanna  ਾ (0A3E).  (This is invalid according to Gurmukhi rules but a similar pattern exists in Devanagri for writing English words in Devanagri. So if a person fluent in Hindi writes in Gurmukhi, he may use this combination) Many English words I found in the corpus were written in this way in Gurmukhi. A few examples are: ‎ਅਨਲਾੱਕ‎ (0A05 0A28 0A32 0A3E 0A71 0A15) ‎‎ਅਲਾੱਟਮੈਂਟ‎ (0A05 0A32 0A3E A71 0A1F A2E A48 0A02 0A1F) ‎and ਕਰਾੱਸ‎ (0A15 0A30 0A3E 0A71 0A38)

‎All this has resulted in the high number of invalid labels being generated containing addak.

Coming to the rule for the invalid labels corresponding to the rule invalid context (Follows-only-C-or-N), we found that in nearly 70% of the cases, the errors are due to matras getting merged with vowels. The matra ੁ (0A41) was frequently merged with vowel ਉ (0A09). While the matra ੂ (0A42) was getting merged with vowels ਉ (0A09) or ਊ (0A0A). The matra ੇ (0A47) got merged with vowels ਏ‎ (0A0F),  ਉ (0A09) or ਊ (0A0A). An interesting thing to be noted is that visually the shape of the word does not change when these matras gets merged with these specific vowels (Table 1). Fortunately, these WLE rules capture these potential candidates for phishing attacks as visually the words in first column look exactly same as words in corresponding second columns. So we can observe an additional advantage of these WLE rules is that they capture possible phishing attacks. 

Table 1 : Words with merged matras‎

Word with merged matra

Word without merged matra

ਉੁਸ‎ (0A09 0A41 0A38)

ਉੁਸ ‎ (0A09 0A38)

ਊੂਧਵ (0A0A 0A42 0A27 0A35)

ਊਧਵ (0A0A 0A27 0A35)

ਤੇਂਦੂਏੇ‎ (0A24 0A47 0A02 0A26 0A42 0A0F 0A47)

ਤੇਂਦੂਏੇ‎ (0A24 0A47 0A02 0A26 0A42 0A0F)

Another issue I came across was forming a new vowel+matra combination, ਅੋ ‎ = ਅ (0A05)+ ੋ(A4B). This is a totally illegal combination, but surprisingly there were many words containing this combination. Example :  ‎ਮਾਅੋ‎ (0A2E 0A3E 0A05 0A4B), ‎ਦਿਅੋਗੋ‎ (0A26 0A3F 0A05 0A4B 0A17 0A4B) and ‎ਪਾਅੋਲੋ‎ (0A2A 0A3E 0A05 0A4B 0A32 0A4B). In real life no one uses this combination.

In fact many of the invalid labels are very rarely generated in real life and its surprising to see so many such combinations present in the corpus.

Thanks

On Tue, May 8, 2018 at 11:31 PM, Sarmad Hussain <sarmad.hussain at icann.org <mailto:sarmad.hussain at icann.org> > wrote:

Dear Dr. Lehal, All,

Thank you for sharing the updated LGR proposal for Gurmukhi script.  Integration panel is currently reviewing it and developing the feedback document.  

In the meantime, they have run a corpus of Punjabi in Gurmukhi script with the test results attached and summarized below.  In the summary, IP has identified some cases which show invalid labels with a slightly high percentage (in red below).  You can review the actual labels in the data file attached, which is marked up accordingly.  

The IP would like to share this data and the summary below with the NBGP for the GP to reconfirm that the failing labels should actually fail - and it is not the case that the indicated rules are too restrictive.  

We aim to share the IP feedback document next week.  Please let us know if you have any questions.

Regards,
Sarmad

=============

Corpus: https://github.com/unicode-org/unilex/tree/master/data/frequency [github.com] <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_unicode-2Dorg_unilex_tree_master_data_frequency&d=DwMDaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=AJsOy7J0b8rICy7w2ks4x8ScEzkwHhaAz9NnbMjvZOc&s=VI9VuAXqLfgAs12WyNhbN7PW4Mi2rjf26DV4G7HrZcw&e=> 

Full Test results attached.

A./

SUMMARY

    Total Labels processed: 171388 of which
         valid labels:   163289
         invalid labels: 7391
         skipped labels: 708 of which
            duplicate labels:      21
            broken labels:         11          <-- rejected by IDN library as not NFC or other malformed
            contain join controls: 287 <-- are these stylistic or orthographic?
            start w/ wrong script: 389 (contamination)

Number of invalid labels by reason:
   4742 instances of not in repertoire
   173 instances of out-of-repertoire variant
   167 instances of invalid context (Follows-only-specific-V-or-M)                                              0.1%
   238 instances of invalid context (Follows-only-C-or-N-and-precedes-only-C2)                   0.15%
   285 instances of invalid context (Follows-only-C-N-or-specific-V-or-M)                                0.17%
   61 instances of invalid context (Follows-only-C1)
   833 instances of invalid context (Follows-only-C-or-N)                                                              0.5%
   892 instances of invalid context (Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN)    0.6%

** rough indication of percentage; higher percentage failures may indicate either that certain typos are common or that
** a rule is too restrictive. The following example shows some the contexts detected for one of the rules - for more detail
** and actual labels see attached.

  Contexts not matching rule "Follows-only-C-or-N":
    [:Bindi:]  ⚓=[:Matra:]
    [:Matra:]  ⚓=[:Matra:]
    [:Tippi:]  ⚓=[:Matra:]
    [:Vowel:]  ⚓=[:Matra:]

Test Label Coverage:
Repertoire (code points):  56 of  56. {0A02 0A05-0A0A 0A0F-0A10 0A13-0A28 0A2A-0A30 0A32 0A35 0A38-0A39 0A3C 0A3E-0A42 0A47-0A48 0A4B-...}
Repertoire not covered:   0 of  56. {}
Out of Repertoire:         80. [{0027 002E 0030-003A 0061-0062 0064-0065 0067 0069-006A 006C 0070 0073 0075 0078 00E0 00E2 00ED-00EE 0901-0902 0906-0909 090F 0913 0915-0918 091A-091D 091F-0924 0926-0928 092A 092C-0930 0932 0935-0939 093C 093E-0942 0947-0948 094B 094D 0A6B 0A72-0A74}]  <-- excluded code points highlighted

Tag Values:                12 of  12.
    Addak
    Bindi
    C1
    C2
    Consonant
    M1
    Matra
    Nukta
    Tippi
    V1
    Virama
    Vowel
Named Classes:             13 of  13.
    A
    B
    C
    C1
    C2
    C3
    M
    M1
    M2
    N
    V
    V1
    V2

Context Rules matched:      6 of   6.

    Follows-only-C-or-N-and-precedes-only-C2
    Follows-only-C-or-N
    Follows-only-specific-V-or-M
    Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN
    Follows-only-C1
    Follows-only-C-N-or-specific-V-or-M

Context Rules failed:       6 of   6.
    Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN
    Follows-only-specific-V-or-M
    Follows-only-C-or-N
    Follows-only-C-or-N-and-precedes-only-C2
    Follows-only-C-N-or-specific-V-or-M
    Follows-only-C1

When Rules defined: (required context)
    Follows-only-specific-V-or-M
    Follows-only-C1
    Follows-only-C-or-N
    Follows-only-C-or-N-and-precedes-only-C2
    Follows-only-C-N-or-specific-V-or-M
    Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN

Not-When Rules defined: (prohibited context)
     (none)

-- 

Dr. Gurpreet Singh Lehal,
Professor, Department of Computer Science

Dean, Faculty of Computing Sciences

Director,  Research Centre for Punjabi Language Technology,
Punjabi University, Patiala.
India-147002

https://en.wikipedia.org/wiki/Gurpreet_Singh_Lehal [en.wikipedia.org] <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Gurpreet-5FSingh-5FLehal&d=DwMFaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=IdfwIYGM8XP9t3FRHN_AO1TaTJ3laMlqpPH0PbEawo4&s=SrvE6CuNWtWUdivhjppQ0Fzsq9BZKedHnBT8gfb6F5M&e=> 

Phone : +91-9815473767 (M)
url : www.learnpunjabi.org [learnpunjabi.org] <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.learnpunjabi.org&d=DwMFaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=IdfwIYGM8XP9t3FRHN_AO1TaTJ3laMlqpPH0PbEawo4&s=QOMqx6OBYd-zyc1mNiAK7AIHeBd9qjMAoMBRPjxhbWc&e=> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180512/248c6ae5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3755 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180512/248c6ae5/smime-0001.p7s>