[Neobrahmigp] [Ext] Re: FW: Corpus review for punjabi

Dr. G. S. Lehal (ਗੁਰਪ੍ਰੀਤ ਸਿੰਘ ਲਹਿਲ) gslehal at gmail.com
Sat May 12 17:48:07 UTC 2018


You are most welcome, Dr. Sarmad.

On Sat, May 12, 2018 at 10:29 PM, Sarmad Hussain <sarmad.hussain at icann.org>
wrote:

> Thank you Dr. Lehal for such an elaborate analysis and feedback.
>
>
>
> The analysis shows that the label level rules proposed for the Gurmukhi
> script are working as intended.
>
>
>
> We will pass this feedback to the integration panel.
>
>
>
> Regards,
> Sarmad
>
>
>
> *From:* Dr. G. S. Lehal (ਗੁਰਪ੍ਰੀਤ ਸਿੰਘ ਲਹਿਲ) [mailto:gslehal at gmail.com]
> *Sent:* Saturday, May 12, 2018 1:18 AM
> *To:* Sarmad Hussain <sarmad.hussain at icann.org>; Dr. AJAY D A T A <
> ajay at data.in>
> *Cc:* neo brahmi <neobrahmigp at icann.org>; Pitinan Kooarmornpatana <
> pitinan.koo at icann.org>
> *Subject:* [Ext] Re: FW: Corpus review for punjabi
>
>
>
> Hello all,
>
> I had a detailed look at the invalid labels for the rule
> (Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN) .
> This rule corresponds to wrong usage of Gurmukhi addak ੱ (0A71). The main
> reasons for these invalid labels, I observed are:
>
> 1.     Typing mistake
>
> 2.     Typist not sure about where to use addak and where not to use it
> according to Gurmukhi script rules. Actually many Punjabi users are
> confused about it, which results in wrong labels being generated.
>
> The main typing mistakes I observed in the corpus are:
>
> 1.     Two consective addaks (Not allowed in Gurmukhi script)
>
> Example ਉੱੱਚ‎ (0A09 0A71 0A71 0A1A), ‎ਉਪਲੱੱਬਧ‎ (0A09 0A2A 0A32 0A71 0A71
> 0A2C 0A27) ‎ਅੱੱਡਾ‎ (0A05 0A71 0A71 0A21 0A3E)
>
> 2.     Swapping addak with preceding matra. Examples
>
> ਚੱੈਸ‎ (0A1A 0A71 0A48 0A38)
>
> ‎ਪ੍ਰੱੈਸ‎ (0A2A 0A4D 0A30 0A71 0A48 0A38)
>
> 3.     Putting addak at end of a word (does not make sense as addak used
> to geminate sound of following consonant). Two such examples in corpus are:
> ‎ਉੱ‎ (0A09 0A71) and ‎ਉਪਲੱ‎ (0A09 0A2A 0A32 0A71)
>
> The mistakes related to usage of addak, which I observed in the corpus are:
>
> 1.    ‎ Wrong usage of addak (Many typists are not aware where to put
> addak and where not to put it. They have put addak after any short vowel
> without knowing if the following consonant can be geminated. Some such
> example found in corpus are  ‎ਉਜੱੜ‎ (0A09 0A1C 0A71 0A5C), ‎ਉਡੱਣ‎ (0A09
> 0A21 0A71 0A23) and ‎ਉਪਲੱਹਧ‎ (0A09 0A2A 0A32 0A71 0A39 0A27)
>
> 2.     ‎Addak followed by a long vowel : According to Gurmukhi rules,
> addak has to be followed by a specific set of consonants only and NOT with
> any vowel. But there are few instances where it was followed by long vowel
> ਈ (0A08) ਆ (0A06) making it an invalid label in the corpus. Some examples
> are: ਉਸਰੱਈਏ‎ (0A09 0A38 0A30 0A71 0A08 0A0F). ‎ਉਰੱਈ‎ (0A09 0A30 0A71
> 0A08) and ‎ਅਤਿੱਆਚਾਰ‎ (0A05 0A24 0A3F 0A71 0A06 0A1A 0A3E 0A30)
>
> 3.     Writing Addak after a long vowel. Addak is not allowed  to be
> written after most of the long vowels, but many typists who are not fluent
> in Punjabi, place it after such vowels resulting in generation of invalid
> labels. Two examples from the corpus are : ਊੱਠਣੀ‎ (0A0A 0A71 0A20 0A23
> 0A40) and ਓੱਪੋ‎ (0A13 0A71 0A2A 0A4B)
>
> 4.     Addak followed by matra kanna  ਾ (0A3E).  (This is invalid
> according to Gurmukhi rules but a similar pattern exists in Devanagri for
> writing English words in Devanagri. So if a person fluent in Hindi writes
> in Gurmukhi, he may use this combination) Many English words I found in the
> corpus were written in this way in Gurmukhi. A few examples are: ‎ਅਨਲਾੱਕ‎
> (0A05 0A28 0A32 0A3E 0A71 0A15) ‎‎ਅਲਾੱਟਮੈਂਟ‎ (0A05 0A32 0A3E A71 0A1F A2E
> A48 0A02 0A1F) ‎and ਕਰਾੱਸ‎ (0A15 0A30 0A3E 0A71 0A38)
>
> ‎All this has resulted in the high number of invalid labels being
> generated containing addak.
>
> Coming to the rule for the invalid labels corresponding to the rule
> invalid context (Follows-only-C-or-N), we found that in nearly 70% of the
> cases, the errors are due to matras getting merged with vowels. The matra
> ੁ (0A41) was frequently merged with vowel ਉ (0A09). While the matra ੂ (0A42)
> was getting merged with vowels ਉ (0A09) or ਊ (0A0A). The matra ੇ (0A47)
> got merged with vowels ਏ‎ (0A0F),  ਉ (0A09) or ਊ (0A0A). An interesting
> thing to be noted is that visually the shape of the word does not change
> when these matras gets merged with these specific vowels (Table 1).
> Fortunately, these WLE rules capture these potential candidates for
> phishing attacks as visually the words in first column look exactly same as
> words in corresponding second columns. So we can observe an additional
> advantage of these WLE rules is that they capture possible phishing
> attacks.
>
> Table 1 : Words with merged matras‎
>
> Word with merged matra
>
> Word without merged matra
>
> ਉੁਸ‎ (0A09 0A41 0A38)
>
> ਉੁਸ ‎ (0A09 0A38)
>
> ਊੂਧਵ (0A0A 0A42 0A27 0A35)
>
> ਊਧਵ (0A0A 0A27 0A35)
>
> ਤੇਂਦੂਏੇ‎ (0A24 0A47 0A02 0A26 0A42 0A0F 0A47)
>
> ਤੇਂਦੂਏੇ‎ (0A24 0A47 0A02 0A26 0A42 0A0F)
>
>
>
>
>
>
>
> Another issue I came across was forming a new vowel+matra combination, ਅੋ
> ‎ = ਅ (0A05)+ ੋ(A4B). This is a totally illegal combination, but
> surprisingly there were many words containing this combination. Example :
> ‎ਮਾਅੋ‎ (0A2E 0A3E 0A05 0A4B), ‎ਦਿਅੋਗੋ‎ (0A26 0A3F 0A05 0A4B 0A17 0A4B)
> and ‎ਪਾਅੋਲੋ‎ (0A2A 0A3E 0A05 0A4B 0A32 0A4B). In real life no one uses
> this combination.
>
> In fact many of the invalid labels are very rarely generated in real life
> and its surprising to see so many such combinations present in the corpus.
>
> Thanks
>
>
>
> On Tue, May 8, 2018 at 11:31 PM, Sarmad Hussain <sarmad.hussain at icann.org>
> wrote:
>
> Dear Dr. Lehal, All,
>
>
>
> Thank you for sharing the updated LGR proposal for Gurmukhi script.
> Integration panel is currently reviewing it and developing the feedback
> document.
>
>
>
> In the meantime, they have run a corpus of Punjabi in Gurmukhi script with
> the test results attached and summarized below.  In the summary, IP has
> identified some cases which show invalid labels with a slightly high
> percentage (in red below).  You can review the actual labels in the data
> file attached, which is marked up accordingly.
>
>
>
> The IP would like to share this data and the summary below with the NBGP
> for the GP to reconfirm that the failing labels should actually fail - and
> it is not the case that the indicated rules are too restrictive.
>
>
>
> We aim to share the IP feedback document next week.  Please let us know if
> you have any questions.
>
>
>
> Regards,
> Sarmad
>
>
>
> =============
>
>
>
> Corpus: https://github.com/unicode-org/unilex/tree/master/data/frequency
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_unicode-2Dorg_unilex_tree_master_data_frequency&d=DwMDaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=AJsOy7J0b8rICy7w2ks4x8ScEzkwHhaAz9NnbMjvZOc&s=VI9VuAXqLfgAs12WyNhbN7PW4Mi2rjf26DV4G7HrZcw&e=>
>
> Full Test results attached.
>
> A./
>
> SUMMARY
>
>     Total Labels processed: 171388 of which
>          valid labels:   163289
>          invalid labels: 7391
>          skipped labels: 708 of which
>             duplicate labels:      21
>             broken labels:         11          <-- rejected by IDN
> library as not NFC or other malformed
>             contain join controls: 287 <-- are these stylistic or
> orthographic?
>             start w/ wrong script: 389 (contamination)
>
> Number of invalid labels by reason:
>    4742 instances of not in repertoire
>    173 instances of out-of-repertoire variant
>    167 instances of invalid context (Follows-only-specific-V-or-M)
>                                               0.1%
>    238 instances of invalid context (Follows-only-C-or-N-and-
> precedes-only-C2)                   0.15%
>    285 instances of invalid context (Follows-only-C-N-or-specific-
> V-or-M)                                0.17%
>    61 instances of invalid context (Follows-only-C1)
>    833 instances of invalid context (Follows-only-C-or-N)
>                                                      0.5%
>    892 instances of invalid context (Follows-only-C-N-or-specific-
> V-or-M-and-precedes-only-C3-or-specific-CN)    0.6%
>
> ** rough indication of percentage; higher percentage failures may indicate
> either that certain typos are common or that
> ** a rule is too restrictive. The following example shows some the
> contexts detected for one of the rules - for more detail
> ** and actual labels see attached.
>
>   Contexts not matching rule "Follows-only-C-or-N":
>     [:Bindi:]  ⚓=[:Matra:]
>     [:Matra:]  ⚓=[:Matra:]
>     [:Tippi:]  ⚓=[:Matra:]
>     [:Vowel:]  ⚓=[:Matra:]
>
>
> *Test Label Coverage:*
> Repertoire (code points):  56 of  56. {0A02 0A05-0A0A 0A0F-0A10 0A13-0A28
> 0A2A-0A30 0A32 0A35 0A38-0A39 0A3C 0A3E-0A42 0A47-0A48 0A4B-...}
> Repertoire not covered:   0 of  56. {}
> Out of Repertoire:         80. [{0027 002E 0030-003A 0061-0062 0064-0065
> 0067 0069-006A 006C 0070 0073 0075 0078 00E0 00E2 00ED-00EE 0901-0902
> 0906-0909 090F 0913 0915-0918 091A-091D 091F-0924 0926-0928 092A 092C-0930
> 0932 0935-0939 093C 093E-0942 0947-0948 094B 094D 0A6B 0A72-0A74}]  <--
> excluded code points highlighted
>
> Tag Values:                12 of  12.
>     Addak
>     Bindi
>     C1
>     C2
>     Consonant
>     M1
>     Matra
>     Nukta
>     Tippi
>     V1
>     Virama
>     Vowel
> Named Classes:             13 of  13.
>     A
>     B
>     C
>     C1
>     C2
>     C3
>     M
>     M1
>     M2
>     N
>     V
>     V1
>     V2
>
> Context Rules matched:      6 of   6.
>
>     Follows-only-C-or-N-and-precedes-only-C2
>     Follows-only-C-or-N
>     Follows-only-specific-V-or-M
>     Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
>     Follows-only-C1
>     Follows-only-C-N-or-specific-V-or-M
>
> Context Rules failed:       6 of   6.
>     Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
>     Follows-only-specific-V-or-M
>     Follows-only-C-or-N
>     Follows-only-C-or-N-and-precedes-only-C2
>     Follows-only-C-N-or-specific-V-or-M
>     Follows-only-C1
>
> When Rules defined: (required context)
>     Follows-only-specific-V-or-M
>     Follows-only-C1
>     Follows-only-C-or-N
>     Follows-only-C-or-N-and-precedes-only-C2
>     Follows-only-C-N-or-specific-V-or-M
>     Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
>
> Not-When Rules defined: (prohibited context)
>      (none)
>
>
>
>
>
>
>
> --
>
> Dr. Gurpreet Singh Lehal,
> Professor, Department of Computer Science
>
> Dean, Faculty of Computing Sciences
>
> Director,  Research Centre for Punjabi Language Technology,
> Punjabi University, Patiala.
> India-147002
>
> https://en.wikipedia.org/wiki/Gurpreet_Singh_Lehal [en.wikipedia.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Gurpreet-5FSingh-5FLehal&d=DwMFaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=IdfwIYGM8XP9t3FRHN_AO1TaTJ3laMlqpPH0PbEawo4&s=SrvE6CuNWtWUdivhjppQ0Fzsq9BZKedHnBT8gfb6F5M&e=>
>
>
>
> Phone : +91-9815473767 (M)
> url : www.learnpunjabi.org [learnpunjabi.org]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.learnpunjabi.org&d=DwMFaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=IdfwIYGM8XP9t3FRHN_AO1TaTJ3laMlqpPH0PbEawo4&s=QOMqx6OBYd-zyc1mNiAK7AIHeBd9qjMAoMBRPjxhbWc&e=>
>



-- 
Dr. Gurpreet Singh Lehal,
Professor, Department of Computer Science
Dean, Faculty of Computing Sciences
Director,  Research Centre for Punjabi Language Technology,
Punjabi University, Patiala.
India-147002

https://en.wikipedia.org/wiki/Gurpreet_Singh_Lehal

Phone : +91-9815473767 (M)
url : www.learnpunjabi.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180512/2a6bed17/attachment-0001.html>


More information about the Neobrahmigp mailing list