[Neobrahmigp] FW: Corpus review for punjabi

Sarmad Hussain sarmad.hussain at icann.org
Tue May 8 18:01:41 UTC 2018


Dear Dr. Lehal, All,

Thank you for sharing the updated LGR proposal for Gurmukhi script.  Integration panel is currently reviewing it and developing the feedback document.

In the meantime, they have run a corpus of Punjabi in Gurmukhi script with the test results attached and summarized below.  In the summary, IP has identified some cases which show invalid labels with a slightly high percentage (in red below).  You can review the actual labels in the data file attached, which is marked up accordingly.

The IP would like to share this data and the summary below with the NBGP for the GP to reconfirm that the failing labels should actually fail - and it is not the case that the indicated rules are too restrictive.

We aim to share the IP feedback document next week.  Please let us know if you have any questions.

Regards,
Sarmad

=============


Corpus: https://github.com/unicode-org/unilex/tree/master/data/frequency [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_unicode-2Dorg_unilex_tree_master_data_frequency&d=DwMDaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=AJsOy7J0b8rICy7w2ks4x8ScEzkwHhaAz9NnbMjvZOc&s=VI9VuAXqLfgAs12WyNhbN7PW4Mi2rjf26DV4G7HrZcw&e=>

Full Test results attached.

A./

SUMMARY

    Total Labels processed: 171388 of which
         valid labels:   163289
         invalid labels: 7391
         skipped labels: 708 of which
            duplicate labels:      21
            broken labels:         11          <-- rejected by IDN library as not NFC or other malformed
            contain join controls: 287 <-- are these stylistic or orthographic?
            start w/ wrong script: 389 (contamination)

Number of invalid labels by reason:
   4742 instances of not in repertoire
   173 instances of out-of-repertoire variant
   167 instances of invalid context (Follows-only-specific-V-or-M)                                              0.1%
   238 instances of invalid context (Follows-only-C-or-N-and-precedes-only-C2)                   0.15%
   285 instances of invalid context (Follows-only-C-N-or-specific-V-or-M)                                0.17%
   61 instances of invalid context (Follows-only-C1)
   833 instances of invalid context (Follows-only-C-or-N)                                                              0.5%
   892 instances of invalid context (Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN)    0.6%

** rough indication of percentage; higher percentage failures may indicate either that certain typos are common or that
** a rule is too restrictive. The following example shows some the contexts detected for one of the rules - for more detail
** and actual labels see attached.

  Contexts not matching rule "Follows-only-C-or-N":
    [:Bindi:]  ⚓=[:Matra:]
    [:Matra:]  ⚓=[:Matra:]
    [:Tippi:]  ⚓=[:Matra:]
    [:Vowel:]  ⚓=[:Matra:]


Test Label Coverage:
Repertoire (code points):  56 of  56. {0A02 0A05-0A0A 0A0F-0A10 0A13-0A28 0A2A-0A30 0A32 0A35 0A38-0A39 0A3C 0A3E-0A42 0A47-0A48 0A4B-...}
Repertoire not covered:   0 of  56. {}
Out of Repertoire:         80. [{0027 002E 0030-003A 0061-0062 0064-0065 0067 0069-006A 006C 0070 0073 0075 0078 00E0 00E2 00ED-00EE 0901-0902 0906-0909 090F 0913 0915-0918 091A-091D 091F-0924 0926-0928 092A 092C-0930 0932 0935-0939 093C 093E-0942 0947-0948 094B 094D 0A6B 0A72-0A74}]  <-- excluded code points highlighted

Tag Values:                12 of  12.
    Addak
    Bindi
    C1
    C2
    Consonant
    M1
    Matra
    Nukta
    Tippi
    V1
    Virama
    Vowel
Named Classes:             13 of  13.
    A
    B
    C
    C1
    C2
    C3
    M
    M1
    M2
    N
    V
    V1
    V2
Context Rules matched:      6 of   6.

    Follows-only-C-or-N-and-precedes-only-C2
    Follows-only-C-or-N
    Follows-only-specific-V-or-M
    Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN
    Follows-only-C1
    Follows-only-C-N-or-specific-V-or-M

Context Rules failed:       6 of   6.
    Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN
    Follows-only-specific-V-or-M
    Follows-only-C-or-N
    Follows-only-C-or-N-and-precedes-only-C2
    Follows-only-C-N-or-specific-V-or-M
    Follows-only-C1

When Rules defined: (required context)
    Follows-only-specific-V-or-M
    Follows-only-C1
    Follows-only-C-or-N
    Follows-only-C-or-N-and-precedes-only-C2
    Follows-only-C-N-or-specific-V-or-M
    Follows-only-C-N-or-specific-V-or-M-and-precedes-only-C3-or-specific-CN

Not-When Rules defined: (prohibited context)
     (none)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180508/7b10ecd0/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Disp-Unilex-punjabi-Guru-20180501.log
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20180508/7b10ecd0/Disp-Unilex-punjabi-Guru-20180501-0001.log>


More information about the Neobrahmigp mailing list