[Neobrahmigp] FW: Corpus review for punjabi

Sat May 12 08:18:07 UTC 2018

Hello all,

I had a detailed look at the invalid labels for the rule
(Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN) .
This rule corresponds to wrong usage of Gurmukhi addak ੱ (0A71). The main
reasons for these invalid labels, I observed are:

1.     Typing mistake

2.     Typist not sure about where to use addak and where not to use it
according to Gurmukhi script rules. Actually many Punjabi users are
confused about it, which results in wrong labels being generated.

The main typing mistakes I observed in the corpus are:

1.     Two consective addaks (Not allowed in Gurmukhi script)

Example ਉੱੱਚ‎ (0A09 0A71 0A71 0A1A), ‎ਉਪਲੱੱਬਧ‎ (0A09 0A2A 0A32 0A71 0A71
0A2C 0A27) ‎ਅੱੱਡਾ‎ (0A05 0A71 0A71 0A21 0A3E)

2.     Swapping addak with preceding matra. Examples

ਚੱੈਸ‎ (0A1A 0A71 0A48 0A38)

‎ਪ੍ਰੱੈਸ‎ (0A2A 0A4D 0A30 0A71 0A48 0A38)

3.     Putting addak at end of a word (does not make sense as addak used to
geminate sound of following consonant). Two such examples in corpus are: ‎ਉੱ
‎ (0A09 0A71) and ‎ਉਪਲੱ‎ (0A09 0A2A 0A32 0A71)

The mistakes related to usage of addak, which I observed in the corpus are:

1.    ‎ Wrong usage of addak (Many typists are not aware where to put addak
and where not to put it. They have put addak after any short vowel without
knowing if the following consonant can be geminated. Some such example
found in corpus are  ‎ਉਜੱੜ‎ (0A09 0A1C 0A71 0A5C), ‎ਉਡੱਣ‎ (0A09 0A21 0A71
0A23) and ‎ਉਪਲੱਹਧ‎ (0A09 0A2A 0A32 0A71 0A39 0A27)

2.     ‎Addak followed by a long vowel : According to Gurmukhi rules, addak
has to be followed by a specific set of consonants only and NOT with any
vowel. But there are few instances where it was followed by long vowel ਈ
(0A08) ਆ (0A06) making it an invalid label in the corpus. Some examples are:
ਉਸਰੱਈਏ‎ (0A09 0A38 0A30 0A71 0A08 0A0F). ‎ਉਰੱਈ‎ (0A09 0A30 0A71 0A08) and ‎
ਅਤਿੱਆਚਾਰ‎ (0A05 0A24 0A3F 0A71 0A06 0A1A 0A3E 0A30)

3.     Writing Addak after a long vowel. Addak is not allowed  to be
written after most of the long vowels, but many typists who are not fluent
in Punjabi, place it after such vowels resulting in generation of invalid
labels. Two examples from the corpus are : ਊੱਠਣੀ‎ (0A0A 0A71 0A20 0A23
0A40) and ਓੱਪੋ‎ (0A13 0A71 0A2A 0A4B)

4.     Addak followed by matra kanna  ਾ (0A3E).  (This is invalid according
to Gurmukhi rules but a similar pattern exists in Devanagri for writing
English words in Devanagri. So if a person fluent in Hindi writes in
Gurmukhi, he may use this combination) Many English words I found in the
corpus were written in this way in Gurmukhi. A few examples are: ‎ਅਨਲਾੱਕ‎ (0A05
0A28 0A32 0A3E 0A71 0A15) ‎‎ਅਲਾੱਟਮੈਂਟ‎ (0A05 0A32 0A3E 0A71 0A1F 0A2E 0A48 0
A02 0A1F) ‎and ਕਰਾੱਸ‎ (0A15 0A30 0A3E 0A71 0A38)

‎All this has resulted in the high number of invalid labels being generated
containing addak.

Coming to the rule for the invalid labels corresponding to the rule invalid
context (Follows-only-C-or-N), we found that in nearly 70% of the cases,
the errors are due to matras getting merged with vowels. The matra ੁ (0A41)
was frequently merged with vowel ਉ (0A09). While the matra ੂ (0A42) was
getting merged with vowels ਉ (0A09) or ਊ (0A0A). The matra ੇ (0A47) got
merged with vowels ਏ‎ (0A0F),  ਉ (0A09) or ਊ (0A0A). An interesting thing
to be noted is that visually the shape of the word does not change when
these matras gets merged with these specific vowels (Table 1). Fortunately,
these WLE rules capture these potential candidates for phishing attacks as
visually the words in first column look exactly same as words in
corresponding second columns. So we can observe an additional advantage of
these WLE rules is that they capture possible phishing attacks.

Table 1 : Words with merged matras‎

Word with merged matra

Word without merged matra

ਉੁਸ‎ (0A09 0A41 0A38)

ਉੁਸ ‎ (0A09 0A38)

ਊੂਧਵ (0A0A 0A42 0A27 0A35)

ਊਧਵ (0A0A 0A27 0A35)

ਤੇਂਦੂਏੇ‎ (0A24 0A47 0A02 0A26 0A42 0A0F 0A47)

ਤੇਂਦੂਏੇ‎ (0A24 0A47 0A02 0A26 0A42 0A0F)

Another issue I came across was forming a new vowel+matra combination, ਅੋ ‎
= ਅ (0A05)+ ੋ(0A4B). This is a totally illegal combination, but
surprisingly there were many words containing this combination. Example :  ‎
ਮਾਅੋ‎ (0A2E 0A3E 0A05 0A4B), ‎ਦਿਅੋਗੋ‎ (0A26 0A3F 0A05 0A4B 0A17 0A4B) and ‎
ਪਾਅੋਲੋ‎ (0A2A 0A3E 0A05 0A4B 0A32 0A4B). In real life no one uses this
combination.

In fact many of the invalid labels are very rarely generated in real life
and its surprising to see so many such combinations present in the corpus.
Thanks

On Tue, May 8, 2018 at 11:31 PM, Sarmad Hussain <sarmad.hussain at icann.org>
wrote:

> Dear Dr. Lehal, All,
>
>
>
> Thank you for sharing the updated LGR proposal for Gurmukhi script.
> Integration panel is currently reviewing it and developing the feedback
> document.
>
>
>
> In the meantime, they have run a corpus of Punjabi in Gurmukhi script with
> the test results attached and summarized below.  In the summary, IP has
> identified some cases which show invalid labels with a slightly high
> percentage (in red below).  You can review the actual labels in the data
> file attached, which is marked up accordingly.
>
>
>
> The IP would like to share this data and the summary below with the NBGP
> for the GP to reconfirm that the failing labels should actually fail - and
> it is not the case that the indicated rules are too restrictive.
>
>
>
> We aim to share the IP feedback document next week.  Please let us know if
> you have any questions.
>
>
>
> Regards,
> Sarmad
>
>
>
> =============
>
>
>
> Corpus: https://github.com/unicode-org/unilex/tree/master/data/frequency
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_unicode-2Dorg_unilex_tree_master_data_frequency&d=DwMDaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=AJsOy7J0b8rICy7w2ks4x8ScEzkwHhaAz9NnbMjvZOc&s=VI9VuAXqLfgAs12WyNhbN7PW4Mi2rjf26DV4G7HrZcw&e=>
>
> Full Test results attached.
>
> A./
>
> SUMMARY
>
>     Total Labels processed: 171388 of which
>          valid labels:   163289
>          invalid labels: 7391
>          skipped labels: 708 of which
>             duplicate labels:      21
>             broken labels:         11          <-- rejected by IDN
> library as not NFC or other malformed
>             contain join controls: 287 <-- are these stylistic or
> orthographic?
>             start w/ wrong script: 389 (contamination)
>
> Number of invalid labels by reason:
>    4742 instances of not in repertoire
>    173 instances of out-of-repertoire variant
>    167 instances of invalid context (Follows-only-specific-V-or-M)
>                                               0.1%
>    238 instances of invalid context (Follows-only-C-or-N-and-
> precedes-only-C2)                   0.15%
>    285 instances of invalid context (Follows-only-C-N-or-specific-
> V-or-M)                                0.17%
>    61 instances of invalid context (Follows-only-C1)
>    833 instances of invalid context (Follows-only-C-or-N)
>                                                      0.5%
>    892 instances of invalid context (Follows-only-C-N-or-specific-
> V-or-M-and-precedes-only-C3-or-specific-CN)    0.6%
>
> ** rough indication of percentage; higher percentage failures may indicate
> either that certain typos are common or that
> ** a rule is too restrictive. The following example shows some the
> contexts detected for one of the rules - for more detail
> ** and actual labels see attached.
>
>   Contexts not matching rule "Follows-only-C-or-N":
>     [:Bindi:]  ⚓=[:Matra:]
>     [:Matra:]  ⚓=[:Matra:]
>     [:Tippi:]  ⚓=[:Matra:]
>     [:Vowel:]  ⚓=[:Matra:]
>
>
> *Test Label Coverage:*
> Repertoire (code points):  56 of  56. {0A02 0A05-0A0A 0A0F-0A10 0A13-0A28
> 0A2A-0A30 0A32 0A35 0A38-0A39 0A3C 0A3E-0A42 0A47-0A48 0A4B-...}
> Repertoire not covered:   0 of  56. {}
> Out of Repertoire:         80. [{0027 002E 0030-003A 0061-0062 0064-0065
> 0067 0069-006A 006C 0070 0073 0075 0078 00E0 00E2 00ED-00EE 0901-0902
> 0906-0909 090F 0913 0915-0918 091A-091D 091F-0924 0926-0928 092A 092C-0930
> 0932 0935-0939 093C 093E-0942 0947-0948 094B 094D 0A6B 0A72-0A74}]  <--
> excluded code points highlighted
>
> Tag Values:                12 of  12.
>     Addak
>     Bindi
>     C1
>     C2
>     Consonant
>     M1
>     Matra
>     Nukta
>     Tippi
>     V1
>     Virama
>     Vowel
> Named Classes:             13 of  13.
>     A
>     B
>     C
>     C1
>     C2
>     C3
>     M
>     M1
>     M2
>     N
>     V
>     V1
>     V2
>
> Context Rules matched:      6 of   6.
>
>     Follows-only-C-or-N-and-precedes-only-C2
>     Follows-only-C-or-N
>     Follows-only-specific-V-or-M
>     Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
>     Follows-only-C1
>     Follows-only-C-N-or-specific-V-or-M
>
> Context Rules failed:       6 of   6.
>     Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
>     Follows-only-specific-V-or-M
>     Follows-only-C-or-N
>     Follows-only-C-or-N-and-precedes-only-C2
>     Follows-only-C-N-or-specific-V-or-M
>     Follows-only-C1
>
> When Rules defined: (required context)
>     Follows-only-specific-V-or-M
>     Follows-only-C1
>     Follows-only-C-or-N
>     Follows-only-C-or-N-and-precedes-only-C2
>     Follows-only-C-N-or-specific-V-or-M
>     Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-
> or-specific-CN
>
> Not-When Rules defined: (prohibited context)
>      (none)
>
>
>

-- 
Dr. Gurpreet Singh Lehal,
Professor, Department of Computer Science
Dean, Faculty of Computing Sciences
Director,  Research Centre for Punjabi Language Technology,
Punjabi University, Patiala.
India-147002

https://en.wikipedia.org/wiki/Gurpreet_Singh_Lehal

Phone : +91-9815473767 (M)
url : www.learnpunjabi.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180512/b4bc3f4f/attachment-0001.html>