[Comments-malayalam-tamil-25sep18] A quick review of the Malayalam proposal

Wed Nov 7 16:00:24 UTC 2018

- §3, §3.1–§3.3: Unclear what the point is for having such a lengthy and detailed introduction of the script’s history. Move it to an appendix, or just remove it.

- §3.5, “Sanskrit, although it falls under EGIDS 4, is not considered in Malayalam script LGR because Malayalam is rarely used to write Sanskrit.”: The Sanskrit language’s Malayalam writing system should have its own EGIDS rating for such an evaluation.

- §3.6, “ICANN's Maximal Starting Repertoire (MSR) for IDN LGR is based on these exclusion rules for ZWJ and ZWNJ.”: Based on what exactly rules?

- §3.6, “But there are no identified cases where a missing ZWNJ forms another valid word with different meaning.”: What’s this discussion of attested “another valid word with different meaning” meant to reflect? A wrong spelling is simply another word, and whether this “another word” means something is a vocabulary problem, which is not really relevant here.

- §3.6, “Missing ZWJ means, the word is a different word with different meaning. This is very rare — …”: This pair is not relevant because the first word uses a ZWJ only because of its chillu, while chillus have atomic encodings.

- §3.6, “Missing ZWJ never means a spelling mistake, but just a writing style.”: It’s plausible to try to distinguish “a spelling mistake” and “a writing style” (which can be better put as “a spelling style” though, given what the example implies). However—

    * The example is not relevant because it uses ZWJ for a chillu.

    * Basically this whole section of ZWJ and ZWNJ requirement probably needs to be preceded by the section that discusses about some ZWJ-using structures that can also be safely encoded without ZWJ, so this group of ZWJ use cases can be excluded.

    * Also, it’s actually unclear why “Missing ZWJ never means a spelling mistake”, as ZWJ is specified by the Unicode Standard (see Table 12-36, Use of Joinders in Malayalam, in the Core Specificartion 11.0) to have the ability of requesting a consonant stack, which is discussed in the first case as a matter of spelling mistakes.

    * The differentiation between a spelling mistake and a writing/spelling style also largely depends on the exact orthograhy being followed.

    * Note the whole situation of when a ZWJ is required and when a ZWNJ is required is highly dependent on fonts. The first case only requires a ZWNJ because Windows’s default Maayalam font, Nirmala UI, as an inappropriately produced font, forms a lot of undesired consonant stacks despite being largely a reformed-orthography font. It strikes me as an apparent necessity that, for such an LGR analysis, a survey of commonly used Malayalam fonts should first be carried out.

- §3.7, Script and Orthography: Unclear why the consonant letter ള ḷa is missing.

- §3.7, Anusvaram and Visargam: “… and hence is traditionally treated as a kind of vowel sign.”: There’s no causality here. Signs like anusvara and visarga are traditionally categorized together with vowel signs (and the category is not necessarily comparable to the modern concept of vowel) because they all are dependent signs that modify a base letter.

- §3.7, Chillu letters (Chillaksharam) and Samvruthokarams, “Chillaksharam is an original feature of Malayalam used only with 6 consonants at present.”: A broader discussion of other rare chillus (in addition to chillu K) should be discussed.

- §3.7, Chillu letters (Chillaksharam) and Samvruthokarams, “Any consonant can be followed by consonant … The chandrakkala alone at the end of a word is treated as Samvruthokaram.”: The paragraph is filled with conflicting statements. Making a clear distinction between the actual written structures and the intended phonetic sequence is important.

- §3.7, Chillu letters (Chillaksharam) and Samvruthokarams, “Chandrakkala coming within a word (followed by other character(s) of the word) denotes a conjunct letter formed by the character(s) preceding and following the chandrakkala.”: Unclear if this is talking about written structures (then a visible chandrakkala sign has nothing to do with a written conjunct) or the general conjunct encoding (then the conjunct is not a letter but a sequence of consonant characters and chandrakkalas that can probably be rendered as a visual structure of conjunct).

- §3.7, Chillu letters (Chillaksharam) and Samvruthokarams, “Examples of Samvruthokaram:”: The document should use a specific orthography by default and explicitly call out when a non-default orthography is discussed for some reason. Here the examples are in the traditional orthography but all the preceding content in the document is basically in the reformed orthography (eg, the “Vowel diacritics” section), and there’s no any note about this inconsistency.

- §3.7, Chillu letters (Chillaksharam) and Samvruthokarams, “For the words that end in chillu, Samvruthokaram is used to make the pronunciation clearer. …”: Unclear how such a phonetic discussion (as well as the following four cases of “phonological transformations”) is relevant to written structures and encoding. Also unclear why only the orthography that uses an explicit vowel sign u is presented in the examples.

- §3.7, A selection of conjunct consonants, Table 5: Adjust column widths to avoid line breaks, which make the NFL row confusing.

- §5.3: See the comment below for §6.1.

- §6.1, set 1: The analysis is a mess.

    * Note the case 1a is a non-standard de facto encoding for the written structure <chillu n base, below-base rra sign>. NBGP need to work with the Unicode Consortium and make sure they give consistent recommendations on this problematic issue.

    * Also, as 1a is rendered as a wrong structure in Windows’ default Malayalam font Nirmala UI, it’s unclear why this encoding is not disallowed because of “rendering problem” (which makes 1b disallowed).

    * About 1b, note the only working sequence (so the only intended sequence) for Nirmala UI on Windows is <NA, VIRAMA, ZWJ, RRA>, although the standard <CHILLU N, VIRAMA, RRA> is somehow also implemented in the font (therefore can be rendered by it with a shaping engine that supports the sequence, while Windows’ shaping engine does support the sequence).

    * Then it’s unclear why “it is safe to disallow” <CHILLU N, VIRAMA, RRA> while allowing <NA, VIRAMA, RRA> when both sequences have rendering problems and only the former one is recommended by the Unicode Standard’s Core Specification.

    * As ordinary fonts shouldn’t render a  character sequence intended for <chillu n base, below-base rra sign> as <chillu n base, rra nase>, therefore there isn’t visual confusability despite spelling and phonetic relationship, it’s unclear why this variant is blocked. Are other spelling alternatives to be blocked too?

- §6.1, set 2: See the comment below for Appendix C.

- §6.2.1, Table 10: A bad rendering of the Tamil glyph in the set 6.

- §7.1.2: Note the document basically suggests such a pattern: `C[M][B|X] | V[B|X] | C[U+0D41]H | L`

    * Rule 5 and 6 should be safe but it’s really unsettling to restrict something not because of written limitations but phonology and spelling conventions.

    * Rule 7 doesn’t seem to be consistent with the restrictions suggested in §6.1 (which disallows ളള… and allows ള്). See the comment below for Appendix C.

    * Note that the Unicode Standard’s Core Specification suggests (see Table 12-33, page 504, in the referred Core Spec 10.0) a samvruthokaram not only appears at the end of a word, but can also appear as an independent vowel letter (typically a word-initial structure) or be followed by a anusvaram. The inconsistency between the Core Spec’s claim and this document’s analysis must be addressed, and the WLE rules might need to be loosened up. Note this is a typical case exhibiting how dangerous it is to set up a restrictive pattern not simply based on written structures but the limited known spelling conventions and phonological theories.

- §10, Appenfix A, Table A-1: The last column seems to be meant to reflect confusable renderings, then the renderings of sequences and atomic characters can be simply merged if the authors don’t have a word processor that allows the sequences to be rendered with correct reordering and without dotted circles.

- §10, Appenfix A, “Although, Unicode defines this canonical decomposition, the Standard recommends not to use the sequence”: The Unicode Standard doesn’t recommend “not to use the sequence[s]”.

- §12, Appendix C:

    * I agree the ള്ള vs ളള pair is indeed worth discussing, since this pair is probably the single most confusable pair in the reformed orthography (while the traditional orthography naturally relies on a greater number of details) because of the structural disadvantage of the letter ള, and other comparably confusable pairs (മ്മ vs മമ, ത്ത vs തത, ക്ത vs കത, etc) are indeed significantly less confusable.

    * However, if the NBGP plans to make restrictions for such an issue, a thorough and accurate research must be first finished. I don’t think either the NBGP or the IP’s current researches and considerations are enough.

    * From what is presented in the document, it seems both the NBGP and the IP have been analyzing only words but not what combinations can occur when inter-word spaces are removed from a sequence of words. However the latter should be a key topic for the discussion, and apparently it can introduce many more sequences that are previously considered highly limited, eg, a much larger number of ളള.

    * Also, it’s not appropriate if the authors have been only analyzing the character sequence but not the final glyph sequence (which includes reordered glyphs, such as pre-base vowel signs, which can break an otherwise confusable sequence, eg, ളള + െ → ളളെ).

    * “The consonant ള (0D33) rarely follows another ള in Malayalam, except in the case of some place names.”: It’s unclear why the NBGP considers attested place names and phrase contractions that contain ളള can be disallowed. The “Feedback from the community” section makes a pretty clear case to me that the NBGP is again over-restricting a script/language based on limited knowledge and prescriptivist grammar.

Best,
梁海 Liang Hai
https://lianghai.github.io

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/comments-malayalam-tamil-25sep18/attachments/20181108/2581d5d7/attachment-0001.html>