[Neobrahmigp] [Ext] Re: FW: Corpus review for punjabi
Dr. G. S. Lehal (ਗੁਰਪ੍ਰੀਤ ਸਿੰਘ ਲਹਿਲ)
gslehal at gmail.com
Sat May 12 17:48:07 UTC 2018
You are most welcome, Dr. Sarmad.
On Sat, May 12, 2018 at 10:29 PM, Sarmad Hussain <sarmad.hussain at icann.org>
wrote:
> Thank you Dr. Lehal for such an elaborate analysis and feedback.
>
>
>
> The analysis shows that the label level rules proposed for the Gurmukhi
> script are working as intended.
>
>
>
> We will pass this feedback to the integration panel.
>
>
>
> Regards,
> Sarmad
>
>
>
> *From:* Dr. G. S. Lehal (ਗੁਰਪ੍ਰੀਤ ਸਿੰਘ ਲਹਿਲ) [mailto:gslehal at gmail.com]
> *Sent:* Saturday, May 12, 2018 1:18 AM
> *To:* Sarmad Hussain <sarmad.hussain at icann.org>; Dr. AJAY D A T A <
> ajay at data.in>
> *Cc:* neo brahmi <neobrahmigp at icann.org>; Pitinan Kooarmornpatana <
> pitinan.koo at icann.org>
> *Subject:* [Ext] Re: FW: Corpus review for punjabi
>
>
>
> Hello all,
>
> I had a detailed look at the invalid labels for the rule
> (Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN) .
> This rule corresponds to wrong usage of Gurmukhi addak ੱ (0A71). The main
> reasons for these invalid labels, I observed are:
>
> 1. Typing mistake
>
> 2. Typist not sure about where to use addak and where not to use it
> according to Gurmukhi script rules. Actually many Punjabi users are
> confused about it, which results in wrong labels being generated.
>
> The main typing mistakes I observed in the corpus are:
>
> 1. Two consective addaks (Not allowed in Gurmukhi script)
>
> Example ਉੱੱਚ (0A09 0A71 0A71 0A1A), ਉਪਲੱੱਬਧ (0A09 0A2A 0A32 0A71 0A71
> 0A2C 0A27) ਅੱੱਡਾ (0A05 0A71 0A71 0A21 0A3E)
>
> 2. Swapping addak with preceding matra. Examples
>
> ਚੱੈਸ (0A1A 0A71 0A48 0A38)
>
> ਪ੍ਰੱੈਸ (0A2A 0A4D 0A30 0A71 0A48 0A38)
>
> 3. Putting addak at end of a word (does not make sense as addak used
> to geminate sound of following consonant). Two such examples in corpus are:
> ਉੱ (0A09 0A71) and ਉਪਲੱ (0A09 0A2A 0A32 0A71)
>
> The mistakes related to usage of addak, which I observed in the corpus are:
>
> 1. Wrong usage of addak (Many typists are not aware where to put
> addak and where not to put it. They have put addak after any short vowel
> without knowing if the following consonant can be geminated. Some such
> example found in corpus are ਉਜੱੜ (0A09 0A1C 0A71 0A5C), ਉਡੱਣ (0A09
> 0A21 0A71 0A23) and ਉਪਲੱਹਧ (0A09 0A2A 0A32 0A71 0A39 0A27)
>
> 2. Addak followed by a long vowel : According to Gurmukhi rules,
> addak has to be followed by a specific set of consonants only and NOT with
> any vowel. But there are few instances where it was followed by long vowel
> ਈ (0A08) ਆ (0A06) making it an invalid label in the corpus. Some examples
> are: ਉਸਰੱਈਏ (0A09 0A38 0A30 0A71 0A08 0A0F). ਉਰੱਈ (0A09 0A30 0A71
> 0A08) and ਅਤਿੱਆਚਾਰ (0A05 0A24 0A3F 0A71 0A06 0A1A 0A3E 0A30)
>
> 3. Writing Addak after a long vowel. Addak is not allowed to be
> written after most of the long vowels, but many typists who are not fluent
> in Punjabi, place it after such vowels resulting in generation of invalid
> labels. Two examples from the corpus are : ਊੱਠਣੀ (0A0A 0A71 0A20 0A23
> 0A40) and ਓੱਪੋ (0A13 0A71 0A2A 0A4B)
>
> 4. Addak followed by matra kanna ਾ (0A3E). (This is invalid
> according to Gurmukhi rules but a similar pattern exists in Devanagri for
> writing English words in Devanagri. So if a person fluent in Hindi writes
> in Gurmukhi, he may use this combination) Many English words I found in the
> corpus were written in this way in Gurmukhi. A few examples are: ਅਨਲਾੱਕ
> (0A05 0A28 0A32 0A3E 0A71 0A15) ਅਲਾੱਟਮੈਂਟ (0A05 0A32 0A3E A71 0A1F A2E
> A48 0A02 0A1F) and ਕਰਾੱਸ (0A15 0A30 0A3E 0A71 0A38)
>
> All this has resulted in the high number of invalid labels being
> generated containing addak.
>
> Coming to the rule for the invalid labels corresponding to the rule
> invalid context (Follows-only-C-or-N), we found that in nearly 70% of the
> cases, the errors are due to matras getting merged with vowels. The matra
> ੁ (0A41) was frequently merged with vowel ਉ (0A09). While the matra ੂ (0A42)
> was getting merged with vowels ਉ (0A09) or ਊ (0A0A). The matra ੇ (0A47)
> got merged with vowels ਏ (0A0F), ਉ (0A09) or ਊ (0A0A). An interesting
> thing to be noted is that visually the shape of the word does not change
> when these matras gets merged with these specific vowels (Table 1).
> Fortunately, these WLE rules capture these potential candidates for
> phishing attacks as visually the words in first column look exactly same as
> words in corresponding second columns. So we can observe an additional
> advantage of these WLE rules is that they capture possible phishing
> attacks.
>
> Table 1 : Words with merged matras
>
> Word with merged matra
>
> Word without merged matra
>
> ਉੁਸ (0A09 0A41 0A38)
>
> ਉੁਸ (0A09 0A38)
>
> ਊੂਧਵ (0A0A 0A42 0A27 0A35)
>
> ਊਧਵ (0A0A 0A27 0A35)
>
> ਤੇਂਦੂਏੇ (0A24 0A47 0A02 0A26 0A42 0A0F 0A47)
>
> ਤੇਂਦੂਏੇ (0A24 0A47 0A02 0A26 0A42 0A0F)
>
>
>
>
>
>
>
> Another issue I came across was forming a new vowel+matra combination, ਅੋ
> = ਅ (0A05)+ ੋ(A4B). This is a totally illegal combination, but
> surprisingly there were many words containing this combination. Example :
> ਮਾਅੋ (0A2E 0A3E 0A05 0A4B), ਦਿਅੋਗੋ (0A26 0A3F 0A05 0A4B 0A17 0A4B)
> and ਪਾਅੋਲੋ (0A2A 0A3E 0A05 0A4B 0A32 0A4B). In real life no one uses
> this combination.
>
> In fact many of the invalid labels are very rarely generated in real life
> and its surprising to see so many such combinations present in the corpus.
>
> Thanks
>
>
>
> On Tue, May 8, 2018 at 11:31 PM, Sarmad Hussain <sarmad.hussain at icann.org>
> wrote:
>
> Dear Dr. Lehal, All,
>
>
>
> Thank you for sharing the updated LGR proposal for Gurmukhi script.
> Integration panel is currently reviewing it and developing the feedback
> document.
>
>
>
> In the meantime, they have run a corpus of Punjabi in Gurmukhi script with
> the test results attached and summarized below. In the summary, IP has
> identified some cases which show invalid labels with a slightly high
> percentage (in red below). You can review the actual labels in the data
> file attached, which is marked up accordingly.
>
>
>
> The IP would like to share this data and the summary below with the NBGP
> for the GP to reconfirm that the failing labels should actually fail - and
> it is not the case that the indicated rules are too restrictive.
>
>
>
> We aim to share the IP feedback document next week. Please let us know if
> you have any questions.
>
>
>
> Regards,
> Sarmad
>
>
>
> =============
>
>
>
> Corpus: https://github.com/unicode-org/unilex/tree/master/data/frequency
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_unicode-2Dorg_unilex_tree_master_data_frequency&d=DwMDaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=AJsOy7J0b8rICy7w2ks4x8ScEzkwHhaAz9NnbMjvZOc&s=VI9VuAXqLfgAs12WyNhbN7PW4Mi2rjf26DV4G7HrZcw&e=>
>
> Full Test results attached.
>
> A./
>
> SUMMARY
>
> Total Labels processed: 171388 of which
> valid labels: 163289
> invalid labels: 7391
> skipped labels: 708 of which
> duplicate labels: 21
> broken labels: 11 <-- rejected by IDN
> library as not NFC or other malformed
> contain join controls: 287 <-- are these stylistic or
> orthographic?
> start w/ wrong script: 389 (contamination)
>
> Number of invalid labels by reason:
> 4742 instances of not in repertoire
> 173 instances of out-of-repertoire variant
> 167 instances of invalid context (Follows-only-specific-V-or-M)
> 0.1%
> 238 instances of invalid context (Follows-only-C-or-N-and-
> precedes-only-C2) 0.15%
> 285 instances of invalid context (Follows-only-C-N-or-specific-
> V-or-M) 0.17%
> 61 instances of invalid context (Follows-only-C1)
> 833 instances of invalid context (Follows-only-C-or-N)
> 0.5%
> 892 instances of invalid context (Follows-only-C-N-or-specific-
> V-or-M-and-precedes-only-C3-or-specific-CN) 0.6%
>
> ** rough indication of percentage; higher percentage failures may indicate
> either that certain typos are common or that
> ** a rule is too restrictive. The following example shows some the
> contexts detected for one of the rules - for more detail
> ** and actual labels see attached.
>
> Contexts not matching rule "Follows-only-C-or-N":
> [:Bindi:] ⚓=[:Matra:]
> [:Matra:] ⚓=[:Matra:]
> [:Tippi:] ⚓=[:Matra:]
> [:Vowel:] ⚓=[:Matra:]
>
>
> *Test Label Coverage:*
> Repertoire (code points): 56 of 56. {0A02 0A05-0A0A 0A0F-0A10 0A13-0A28
> 0A2A-0A30 0A32 0A35 0A38-0A39 0A3C 0A3E-0A42 0A47-0A48 0A4B-...}
> Repertoire not covered: 0 of 56. {}
> Out of Repertoire: 80. [{0027 002E 0030-003A 0061-0062 0064-0065
> 0067 0069-006A 006C 0070 0073 0075 0078 00E0 00E2 00ED-00EE 0901-0902
> 0906-0909 090F 0913 0915-0918 091A-091D 091F-0924 0926-0928 092A 092C-0930
> 0932 0935-0939 093C 093E-0942 0947-0948 094B 094D 0A6B 0A72-0A74}] <--
> excluded code points highlighted
>
> Tag Values: 12 of 12.
> Addak
> Bindi
> C1
> C2
> Consonant
> M1
> Matra
> Nukta
> Tippi
> V1
> Virama
> Vowel
> Named Classes: 13 of 13.
> A
> B
> C
> C1
> C2
> C3
> M
> M1
> M2
> N
> V
> V1
> V2
>
> Context Rules matched: 6 of 6.
>
> Follows-only-C-or-N-and-precedes-only-C2
> Follows-only-C-or-N
> Follows-only-specific-V-or-M
> Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
> Follows-only-C1
> Follows-only-C-N-or-specific-V-or-M
>
> Context Rules failed: 6 of 6.
> Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
> Follows-only-specific-V-or-M
> Follows-only-C-or-N
> Follows-only-C-or-N-and-precedes-only-C2
> Follows-only-C-N-or-specific-V-or-M
> Follows-only-C1
>
> When Rules defined: (required context)
> Follows-only-specific-V-or-M
> Follows-only-C1
> Follows-only-C-or-N
> Follows-only-C-or-N-and-precedes-only-C2
> Follows-only-C-N-or-specific-V-or-M
> Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
>
> Not-When Rules defined: (prohibited context)
> (none)
>
>
>
>
>
>
>
> --
>
> Dr. Gurpreet Singh Lehal,
> Professor, Department of Computer Science
>
> Dean, Faculty of Computing Sciences
>
> Director, Research Centre for Punjabi Language Technology,
> Punjabi University, Patiala.
> India-147002
>
> https://en.wikipedia.org/wiki/Gurpreet_Singh_Lehal [en.wikipedia.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Gurpreet-5FSingh-5FLehal&d=DwMFaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=IdfwIYGM8XP9t3FRHN_AO1TaTJ3laMlqpPH0PbEawo4&s=SrvE6CuNWtWUdivhjppQ0Fzsq9BZKedHnBT8gfb6F5M&e=>
>
>
>
> Phone : +91-9815473767 (M)
> url : www.learnpunjabi.org [learnpunjabi.org]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.learnpunjabi.org&d=DwMFaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=IdfwIYGM8XP9t3FRHN_AO1TaTJ3laMlqpPH0PbEawo4&s=QOMqx6OBYd-zyc1mNiAK7AIHeBd9qjMAoMBRPjxhbWc&e=>
>
--
Dr. Gurpreet Singh Lehal,
Professor, Department of Computer Science
Dean, Faculty of Computing Sciences
Director, Research Centre for Punjabi Language Technology,
Punjabi University, Patiala.
India-147002
https://en.wikipedia.org/wiki/Gurpreet_Singh_Lehal
Phone : +91-9815473767 (M)
url : www.learnpunjabi.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180512/2a6bed17/attachment-0001.html>
More information about the Neobrahmigp
mailing list