[Neobrahmigp] Malayalam LGR Proposal 20190218

Sarmad Hussain sarmad.hussain at icann.org
Mon Mar 4 16:59:12 UTC 2019


Dear Veena and NBGP members,

Thank you for sharing the revised version of the Malayalam proposal.  Please find below feedback on the proposal.  IP would like you to consider some further details.

Kindly update and share the proposal for further review by the IP members.

Regards,
Sarmad

  _____  

To: NeoBrahmi Generation Panel
From: Integration Panel

The Integration Panel has reviewed the updated draft for the Malayalam LGR dated 2019-02-20.

We are noting good progress, but there remain some major pieces that still need work to remove inconsistencies and errors, as well as quite a number of detailed suggestions for additional improvements of the documentation of the proposal.

The biggest issue is that the description and implementation of the various rules in the XML rules and DOCx do not yet match. While the listing of sequences in the XML now matches that in the DOCx, Section 6.1 in the document contains the wrong explanation as to how to handle 0D33 0D33 pairs, etc. and other inconsistencies. 

For Section 7.1.1 we noted that 0D7B was added to the context for 0D4D in the DOCx but is not implemented in the XML. Is that an oversight?

A further significant issue arises in the context of the new sequence Halant+RA. This must be addressed before the LGR can be finalized.



A few of the suggested changes amount to minor, but essential corrections (e.g. fixed some code points in the XML and other details in the rules). The remaining items represent editorial issues.

- Integration Panel

  

Detailed Recommendations:

 

DOC:

(1) Section 7.1.1, the listing of members for "R" needs commas

(2) Section 7.1.1, the listing or members for "R", the "glyph" for 0D4D needs a space for better layout - at least on some versions of Word this reorders around the opening parenthesis.

(3) There are several discrepancies between the rules stated in the XML and the rules stated in Section 7.1.1

We list here the corresponding rule name from the XML and whether a rule matches the document or not.




(3.1) Rule 1:     H must be preceded by C or the M ു (0D41) or the L ൻ (0D7B)

    For this rule, there are several discrepancies between DOCx and XML as well as between Section 7.1.1 and Section 6.1 (1).




    In the XML this rule is implemented is "follows-only-C- or-0D41". There is no mention of OD7B

    Also, for the new sequence 0D4D 0D30, this rule is changed to "follows-only-C"
    (H as part of this sequence could not be preceded by 0D41 or 0D7B as written)

    Accordingly rule 1 needs to be restated, to be brought in alignment with 
    the XML - and  - the XML needs to be amended to match the rule 
    with respect to 0D7B if that is still the intent of the GP.

     [In section 6.1, there is discussion about needing to allow 0D7B 0D4D 0D31, 
     but this is not defined as a variant nor allowed in the current iteration of
     the LGR. If necessary change the discussion from "not disallowed" to "disallowed"]

     If it is the intent to allow an H to be preceded also by U+07DB




(3.2) Rule 2:     M must be preceded by C 

    Matches the XML's "follows-only-C" applied to cp's tagged as Matra.

    ( the use of "only" should be dropped in this and all other rule names 
     in the XML as it is implied by the use as a required context )

 

(3.3) Rule 3:     B must be preceded by C, V or M

    Matches the XML's "follows-only-C-V-or-M" applied to Anusvaram

 

(3.4) Rule 4:     X must be preceded by C, V or M

    Matches the XML's "follows-only-C-V-or-M" applied to Visargam

 

(3.5) Rule 5:     L cannot be preceded by B, X or H 

    Matches the XML's "follows-B-X-or-H" used as "not-when" applied to Chillu

 

(3.6) Rule 6:     Label does not begin with L 

    Matches the XML's "begins-with-L" used as a trigger for an action with disposition "invalid"

 

(3.7) Rule 7:     The ള (0D33) cannot immediately follow ള (0D33)

    While this matches the XML's "followed-by-0D33" used as "not-when" context, because
    this rule is avoided by 0D33 pairs that are part of defined sequences. 
    Therefore the rule should be restated:

        Rule 7:     The character ള (0D33) cannot immediately follow ള (0D33), except as part of a defined sequence

 

(3.8) Rule 8: The റ (0D31) cannot immediately follow റ (0D31) 

    While this matches the XML's "followed-by-0D31" used as "not-when" context,  because 
    this rule is avoided by 0D31 pairs that are part of defined sequences the rule should be restated:

        Rule 8:     The character റ (0D31) cannot immediately follow റ (0D31), except as part of a defined sequence

 

(4) The discussion of "Set 2" in Section 6.1 no longer matches the 
      solution proposed in the XML. This passage needs to be extensively rewritten as follows:

 

(4.1) The text documents the earlier solution which disallowed some sequences.

      This text should be removed, as it does no describe the actual solution, which
      involves variant sequences and context rules.

      "Therefore, NBGP has decided not to define Set 2 as variants, but to handle this case by using a WLE rule. The rule... "

      -->

       "Therefore, NBGP has decided to define a rule (rule 7 in Section 7).."

       and replace the following paragraph with new text:

"The sequences U+0D33 U+0D33    ( ളള ) / U+0D33 U+0D4D U+0D33  ( ള്ള ) and U+0D33 U+0D33 U+0D4D U+0D33  ( ളള്ള ) / U+0D33 U+0D4D U+0D33 U+0D33  ( ള്ളള ) have been defined as variant pairs. However, these sequences and variants are further constrained by context rules on both sequences and variants. To make the "null" variant well-behaved, none of the sequences, nor U+0D33 ( ള ), may be followed by a further U+0D33 . That limits all occurrences of U+0D33 to singletons or explicitly enumerated sequences. At the same time, the variant mappings are not defined if a sequence follows U+0D33 U+0D4D or follows U+0D4D U+0D33, in other words, if it is part of a longer sequence of 0D33 ( ള ) joined by Halant."

(4.2) An explanation of the context rules involving "R" needs to be provided 

Immediately add a paragraph:

"If a reordrant matra follows a sequence it would graphically intervene, thus making the sequences no longer variants. Therefore, the variants are also not defined if a sequence is followed by a reordrant matra. These two context rules are combined into the single context on the variant mapping:

     V1: A variant preceded by 0D33+Halant or followed by 0D33 or R or Halant+0D33 is not defined"





 (4.3) The description of the analogous case of U+0D31 needs to be fixed:

Change: 




    "but instead of depending on that weak assumption, a WLE rule has been added." 

To:

    "but instead of depending on that weak assumption, sequences and variants have been defined in an entirely analogous manner to U+0D33 with a variant context:


     V2: A variant preceded by 0D31+Halant or followed by 0D31 or R or Halant+0D31 is not defined" 




(5) The added "community input" in the appendix is not easy to follow: it is unclear
      what conclusions the GP drew from the feedback and what changes were made
      or not made in response.

       Perhaps the appendix could start with an opening paragraph:

       "This appendix contains copies of all input related to the case of ള (0D33) + ള (0D33). For the adopted solution see  (Section 6.1)."




(6) Reference to MSR needs to be to final public version of the MSR:

[MSR]     Integration Panel, "Maximal Starting Repertoire — MSR-4 Overview and Rationale", 7 February 2019 https://www.icann.org/en/system/files/files/msr-4-overview-25jan19-en.pdf [icann.org] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.icann.org_en_system_files_files_msr-2D4-2Doverview-2D25jan19-2Den.pdf&d=DwMDaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=UCxpDqUlGaog-X21OXwOjq9jbdyfyKjr7WhcB0neIEI&s=k7veK-3ASWg36cWFLYo9dq4Pirwpx9RKxrDwOuGVJOw&e=>  (Accessed on 18th February, 2019)

(7) The definitions in Section 7.1.1 define a category "R". This category has
      several issues:

    (a) It is not referred to in any of the rules 
          (because it appears in variant contexts only, see (4.1))

     (b) It contains not just code points, but one sequence 
           (see XML.9 below)

     (c) It is not discussed in the document, except in an appendix.

     Perhaps a note should be added to 7.1.1 that "R" is used in variant 
    contexts and point the reader to Section 6.1 for details.

 

(8) the technical term in Unicode for R is "reordrant" matra, and the IP recommends to follow that terminology where possible. (Instead of "reordering" matra).




 This comment (8) applies to the XML as well.




 (9) See discussion below for the XML on the sequence U+0D4D U+0D30: depending on how that feedback is resolved, the new sequence may or may not be unnecessary and could be removed. Otherwise, it would be helpful to have a bit more explanation that describes how Halant+RA function in a limited way as reordrant matra and what the implication of that is for IDNs.

XML

(1) XML passes tool

 

(2) Lines 370 and 402 each have a bogus code point: 
      D433 - presumably 0D33 and 0D31 respectively are intended instead

 

(3) rule "follows-0D33 ..." has an extra "l" in "follows".

 

(4) reference to MSR-4 needs to be to final public version:

[MSR-4]     Integration Panel, "Maximal Starting Repertoire — MSR-4 Overview and Rationale", 7 February 2019 https://www.icann.org/en/system/files/files/msr-4-overview-25jan19-en.pdf [icann.org] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.icann.org_en_system_files_files_msr-2D4-2Doverview-2D25jan19-2Den.pdf&d=DwMDaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=UCxpDqUlGaog-X21OXwOjq9jbdyfyKjr7WhcB0neIEI&s=k7veK-3ASWg36cWFLYo9dq4Pirwpx9RKxrDwOuGVJOw&e=>   (Accessed on 18th February, 2019)

(5) Line 359 and 391: the comment has an extra space before "precede" and "follwowed" should be spelled "followed"

 

(6) On line 102 the word "prededed" should be "preceded".

 

(7) On line 101 the code point 0D33 should be 0D31

 

(8) On line 67, the last word should be "conjunct".

 

(9) The "variable" R from the rules in section 7.1.1 is implemented as a "class"; this means it is not
     possible to account for the sequence <U+0D4D u+0D30> as part of the definition of "R".

 

    A possible fix would be to also define a rule R as follows:

    

    <rule name="R" comment="Reordrant Malayalam matras, including sequence U+0D4D U+0D30">

        <choice>

            <class by-ref="R" />
            <char cp="0D4D 0D30" />

         </choice>

      </rule>

 

     and to change any <class by-ref="R"> on lines 354 and 386 to <rule by-ref="R" />

 

(10) " or reordrant vowel" should become  " or reordrant matra" in two comments on rules.

 

(11) After text has been added to the proposal document the [TBD] on line 68 should become:

More details in Sections 6.1 "In-script Variants" and 7.1.1 "Variables or definitions" of the [Proposal]

(12) The <description> is could use expanded documentation on the context rules for variants. 

        Suggested text for the end of the "Variants" section:

 

<p>Context Rules for Variants: some of the variants defined in this LGR are "effective null variants", that is,
    some code points in the source map to "nothing" in the target with all other code points unchanged.
    (Because mappings are symmetric, it does not matter whether it is the forward or reverse mapping that
    maps to "null"). Such variants require a context rule to keep the variant set well-behaved. Symmetry requires
    the same context rule for both forward and reverse mappings.</p>
   
    <p>In other cases, the sequences or code points making up source and target are constrained by context
    rules on the code points. In such a case, any variants require context rules that match the intersection
    between the contexts for both source and target; otherwise a sequence might be considered valid in some
    variant label when it would not be valid in an equivalent context in an original label.</p>

(13) Suggested text for the end of the WLE section (add above "The rules are:")

    <p>Note: the Reordrant Matras include one sequence. That requires an auxiliary rule R in addition to class R.</p>

 

(14) The description of several of the character classes could be edited as follows to align better with the names for the character classes being described as well as general copy editing:

 

    <p>Consonant: Malayalam is written in an abugida script derived ultimately from Brāhmī in which 
    every consonant carries an inherent a. More details in Section 3.8, "The Structure of 
    Malayalam Script" of the [Proposal].</p>

    <p>Matra: Vowels other than the inherent vowel are written as vowel diacritics. They are referred to as Matras, 
    when they follow consonants. More details in Section 3.8, "The Structure of Malayalam Script" of the [Proposal].</p>
    
    <p>Halant: A consonant can be combined with another consonant or conjunct 
    using the halant encoded as U+0D4D MALAYALAM SIGN VIRAMA. This strips off the implicit vowel. 
     More details in Section 3.8, "The Structure of Malayalam Script" of the [Proposal].</p>

    <p>Anusvaram: In Malayalam, anusvara represented as ം (0D02), simply represents a consonant /m/ after a vowel, 
    though this /m/ may be assimilated to another nasal consonant. More details in Section 3.8 "The Structure of Malayalam 
    Script" of the [Proposal].</p>

    <p>Visargam: /വിസർഗം,/ (visargam), or visarga, represents a consonant /h/ after a vowel, 
    and is transliterated as ḥ. Like the anusvara, it is a special symbol, and is never followed by an 
    inherent vowel or another vowel. More details in Section 3.8, "The Structure of Malayalam 
    Script" of the [Proposal].</p>

    <p>Chillu: Chillu letters, aka "Chillaksharam", represent pure consonants without any vowel sound. 
    More details in Section 3.8, "The Structure of Malayalam Script" of the [Proposal].</p>

(15) the word "-only-" can be deleted from all rule names in the XML as it is redundant

(16) The list of rules in the <description> could be numbered (and split into separate lists) with additional information as follows:

    <p>The rules are: </p>
     <ul>
         <li>1. H: must be preceded by C or 0D41</li>
         <li>2. M: must be preceded by C</li>
        <li>3. B: must be preceded by C, V or M</li>
        <li>4. X: must be preceded by C, V or M</li>
        <li>5. L: cannot be preceded by B, X or H</li>
        <li>6. A label does not begin with L</li> 
     </ul>
     <p>The following context rules  apply to code points U+0D33 and U+0D31 as well as to sequences ending in these code points:</p>
     <ul>
        <li>7. The character ള (0D33) cannot immediately follow ള (0D33), except as part of a defined sequence</li>
        <li>8. The character റ (0D31) cannot immediately follow റ (0D31), except as part of a defined sequence</li>
     </ul>
    <p>The following context rules apply to variants:</p>
     <ul>
        <li>V1: A variant preceded by 0D33+Halant or followed by 0D33 or R or Halant+0D33 is not defined</li>
        <li>V2: A variant preceded by 0D31+Halant or followed by 0D31 or R of Halant+0D31 is not defined</li>
     </ul>
    
    <p>More details in Section 6.1 "In-script Variants" and Section 7, "Whole Label Evaluation Rules (WLE)" of the [Proposal]</p>


(18) Some reviewers found it difficult to relate rules 7 and 8 to the context rules defined. Add a further note at then end of the rules section:

<p>Note: the implementation of Rules 7 & 8 relies on the fact that a context rule is not evaluated between code points in the same sequence. For example, if a label contains two adjacent U+0D33 U+0D33 surrounded by other code points , the two code points can only be interpreted as the sequence  U+0D33 U+0D33 ളള because a singleton U+0D33  ള  is not allowed to be followed by another U+0D33 ള.</p>

(19) update the comments on the following rules as follows:

<rule name= "followed-by-0D33" comment="Section 7, WLE 7. The character ള (0D33) cannot immediately follow ള (0D33), except as part of a defined sequence">




 <rule name= "followed-by-0D31" comment="Section 7, WLE 8. The character റ (0D31) cannot immediately follow റ (0D31), except as part of a defined sequence">

<rule name= "follows-0D33-0D4D-or-followed-by-0D33-or-0D4D-0D33-or-R" comment="Section 6.1, V1: variant not defined if preceded by 0D33+Halant or followed by Halant+0D33 or 0D33 or R">




<rule name= "follows-0D31-0D4D-or-followed-by-0D31-or-0D4D-0D31-or-R" comment="Section 6.1, V2: variant not defined if preceded by 0D31+Halant or followed by Halant+0D31 or 0D31 or R"> 

            (Move the rule up, so it follows rule 7; reorder the remaining unnumbered <rule> elements so those 
             referring to 0D33 occur consistently before those referring to 0D31).






(20) Change all instances of "prevent variant if" in XML to "variant not defined if" to match language elsewhere.




 (21) Definition of sequence U+0D4D U+0D30



Given the following excerpt from the repertoire table (from the XML, but shown here as formatted in the HTML format):


U+0D4D

്

Malayalam

MALAYALAM SIGN VIRAMA

[106]

Halant

follows-only-C-or-0D41

✔

 

 


U+0D4D U+0D30

്ര

[Malayalam]

MALAYALAM SIGN VIRAMA + MALAYALAM LETTER RA

 

 

follows-only-C

✔

 

	

we notice that the sequence U+0D4D U+0D30  has a *more restrictive* context rule than the singleton (0D4D) that starts the sequence. As a result the difference in context rule becomes *ineffective*. We believe that this reflects a common misunderstanding about how "partitions" work in a label in the evaluation of context rules.

In a label .... 0D41 0D4D 0D30 .... the partition .... {0D41} {0D4D} {0D30} .... would lead to a valid label (given that there is no context rule for 0D30). Therefore, the alternate partition .... {0D41} {0D4D 0D30} ...., even though it generates an invalid label, is ignored. It does not somehow "veto" or "override" the other legal partition.

If it is somehow important to prevent Halant+Ra from following vowel sign U, then the context rule for Halant could be changed to

    follows-only-C-or-0D41-and-precedes-anything-but-0D30

That would force any combination of 0D4D 0D30 to use the defined sequence and its context rule. Writing such a rule requires expressing the equivalent of [^\u0D30] in Regex notation. The formulation would be a bit involved, but not too much. 

We further note that the sequence occurs one more time in the LGR for the following definition: 

        R    →     Re-Ordering Matra 
                    R =  ( െ, േ,  ൈ, ൊ, ോ,  ് + ര)
            U+0D46 (െ) U+0D47 (േ)U+0D48 (ൈ) U+0D4A (ൊ) U+0D4B (ോ) and [U+0D4D (്) U+0D30 (ര)]

It is not possible for a context rule to directly affect such a definition. Therefore, for the purpose of the definition, any occurrence of U+0D4D U+0D30 would be a "reordrant matra".

The definition of R is used *only* in cases where this sequence follows either 0D31 or 0D33. Even though it is the case that following 0D41 (or probably any non-consonant) this sequence does not act like a reordrant matra, the definition is never invoked in these cases, therefore, it would not be necessary to give this sequence a restricted context for the purpose of defining R. 

Finally, it is not a requirement that a sequence cited in the body of a rule must be listed as a sequence in the repertoire. The latter is only necessary if the sequence is to participate in context rule evaluation (that is, if the sequence can take the place of an "anchor" in a rule). That's not the case for "R" in the Malayalam LGR.

The IP does not have enough data to make a final recommendation, because that would depend on the intent of the GP:

(A) We performed some searches (after prefixing everything with a consonant 0D15), and found no matches for കു്ര , but matches for കു് (that is, without the RA). That seems to indicate that the GP is correct in that such a sequence of Halant+RA does not occur following an 0D41.

(B) However, that does not necessarily mean that such a sequence must be prohibited for LGRs - that would be a separate conclusion.

(C) On at least one test system used, the sequence displays fine; that is perhaps not surprising because RA itself is not a combining mark (കു്ര). Taking the U out in the middle of the sequence gets ക്ര, which is easily found in online documents -- and which displays the Halant+RA before the consonant.

The IP can only give this conditional feedback:

*	If the restriction is not required (that is, the only cost is overproducing some labels that are "nonsense" but still recognizable) then IP would recommend to delete that sequence from the repertoire, on the grounds that the rule is a "spelling rule".
*	If there is valid claim that this allowing this sequences following 0D41 is a concern from user confusion/security aspect, then the GP would need to  provide a corrected version context rule for the bare code point that actually works as intended.
*	Neither of the options impacts the definition of R from the perspective of the LGR; linguistically, the sequence cannot follow 0D41 and still be reordrant (but that's not important here).

TXT

testing TBD: further testing awaits corrections of errors and omissions in the normative part of the XML.

  

  _____  

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20190304/6c710bf1/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5026 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20190304/6c710bf1/smime-0001.p7s>


More information about the Neobrahmigp mailing list