[Neobrahmigp] FW: Corpus review for punjabi
Sarmad Hussain
sarmad.hussain at icann.org
Tue May 8 18:01:41 UTC 2018
Dear Dr. Lehal, All,
Thank you for sharing the updated LGR proposal for Gurmukhi script. Integration panel is currently reviewing it and developing the feedback document.
In the meantime, they have run a corpus of Punjabi in Gurmukhi script with the test results attached and summarized below. In the summary, IP has identified some cases which show invalid labels with a slightly high percentage (in red below). You can review the actual labels in the data file attached, which is marked up accordingly.
The IP would like to share this data and the summary below with the NBGP for the GP to reconfirm that the failing labels should actually fail - and it is not the case that the indicated rules are too restrictive.
We aim to share the IP feedback document next week. Please let us know if you have any questions.
Regards,
Sarmad
=============
Corpus: https://github.com/unicode-org/unilex/tree/master/data/frequency [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_unicode-2Dorg_unilex_tree_master_data_frequency&d=DwMDaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=AJsOy7J0b8rICy7w2ks4x8ScEzkwHhaAz9NnbMjvZOc&s=VI9VuAXqLfgAs12WyNhbN7PW4Mi2rjf26DV4G7HrZcw&e=>
Full Test results attached.
A./
SUMMARY
Total Labels processed: 171388 of which
valid labels: 163289
invalid labels: 7391
skipped labels: 708 of which
duplicate labels: 21
broken labels: 11 <-- rejected by IDN library as not NFC or other malformed
contain join controls: 287 <-- are these stylistic or orthographic?
start w/ wrong script: 389 (contamination)
Number of invalid labels by reason:
4742 instances of not in repertoire
173 instances of out-of-repertoire variant
167 instances of invalid context (Follows-only-specific-V-or-M) 0.1%
238 instances of invalid context (Follows-only-C-or-N-and-precedes-only-C2) 0.15%
285 instances of invalid context (Follows-only-C-N-or-specific-V-or-M) 0.17%
61 instances of invalid context (Follows-only-C1)
833 instances of invalid context (Follows-only-C-or-N) 0.5%
892 instances of invalid context (Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN) 0.6%
** rough indication of percentage; higher percentage failures may indicate either that certain typos are common or that
** a rule is too restrictive. The following example shows some the contexts detected for one of the rules - for more detail
** and actual labels see attached.
Contexts not matching rule "Follows-only-C-or-N":
[:Bindi:] ⚓=[:Matra:]
[:Matra:] ⚓=[:Matra:]
[:Tippi:] ⚓=[:Matra:]
[:Vowel:] ⚓=[:Matra:]
Test Label Coverage:
Repertoire (code points): 56 of 56. {0A02 0A05-0A0A 0A0F-0A10 0A13-0A28 0A2A-0A30 0A32 0A35 0A38-0A39 0A3C 0A3E-0A42 0A47-0A48 0A4B-...}
Repertoire not covered: 0 of 56. {}
Out of Repertoire: 80. [{0027 002E 0030-003A 0061-0062 0064-0065 0067 0069-006A 006C 0070 0073 0075 0078 00E0 00E2 00ED-00EE 0901-0902 0906-0909 090F 0913 0915-0918 091A-091D 091F-0924 0926-0928 092A 092C-0930 0932 0935-0939 093C 093E-0942 0947-0948 094B 094D 0A6B 0A72-0A74}] <-- excluded code points highlighted
Tag Values: 12 of 12.
Addak
Bindi
C1
C2
Consonant
M1
Matra
Nukta
Tippi
V1
Virama
Vowel
Named Classes: 13 of 13.
A
B
C
C1
C2
C3
M
M1
M2
N
V
V1
V2
Context Rules matched: 6 of 6.
Follows-only-C-or-N-and-precedes-only-C2
Follows-only-C-or-N
Follows-only-specific-V-or-M
Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN
Follows-only-C1
Follows-only-C-N-or-specific-V-or-M
Context Rules failed: 6 of 6.
Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN
Follows-only-specific-V-or-M
Follows-only-C-or-N
Follows-only-C-or-N-and-precedes-only-C2
Follows-only-C-N-or-specific-V-or-M
Follows-only-C1
When Rules defined: (required context)
Follows-only-specific-V-or-M
Follows-only-C1
Follows-only-C-or-N
Follows-only-C-or-N-and-precedes-only-C2
Follows-only-C-N-or-specific-V-or-M
Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN
Not-When Rules defined: (prohibited context)
(none)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180508/7b10ecd0/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Disp-Unilex-punjabi-Guru-20180501.log
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180508/7b10ecd0/Disp-Unilex-punjabi-Guru-20180501-0001.log>
More information about the Neobrahmigp
mailing list