[Latingp] Digraphs

Chris Dillon ccaacdi at ucl.ac.uk
Mon May 16 13:40:35 UTC 2016


Dear Meikal,

I think it's only a matter of time before combining marks are required, 
but I think we should only allow them in restricted situations.

All other code points* may be used in any position with any other code 
point(s). Combining marks would only be allowed in certain positions 
with certain other code points. If, for example, ^x (x with a 
circumflex), which does not exist as a pre-composed code point, were 
required somewhere in Africa, the combining mark ^ would only be allowed 
with x.

Is that better?

Regards,

Chris.
*as far as I know and except ß which may not be used label-initially

On 16/05/2016 14:26, Meikal Mumin wrote:
> Dear Chris,
>
> could you clarify or exemplify what you mean by " I would suggest that 
> we take the approach "combining mark X is required in the following 
> sequence(s) of code points only", rather than "combining mark X is 
> included with any other code point"."?
>
> Thanks,
>
> Meikal
>
> 2016-05-16 10:39 GMT+02:00 Dillon, Chris <c.dillon at ucl.ac.uk 
> <mailto:c.dillon at ucl.ac.uk>>:
>
>     Dear Meikal & Abdeslam,
>
>     Thank you for your emails. This correspondence is a good summary
>     of answers to difficult questions, along these lines:
>
>       * Variants may consist of more than one code point.
>       * So far we have been able to exclude combining marks, but it is
>         doubtful that that will continue to be possible once more work
>         has been done on the use of the Latin Script in Africa. I
>         would suggest that we take the approach "combining mark X is
>         required in the following sequence(s) of code points only",
>         rather than "combining mark X is included with any other code
>         point".
>       * As regards ij and most other ligatures, they would be
>         unallocatable variants, or possibly out-of-repertoire code points.
>       * I like the suggestion of waiting for the IP's informal
>         comments before releasing our draft repertoire. The Second
>         Level Team's work, however, could require a substantial effort
>         to digest and so we should probably wait.
>
>     Français: Ces emails forment une synthèse utile de réponses à
>     quelques questions compliquées:
>
>     ·Les variants peuvent consister en plus d’une lettre Unicode.
>
>     ·Si on a besoin de signes pour combiner des lettres Unicode, on
>     pourrait seulement les utiliser en des cas limités.
>
>     ·Ij, etc. sont peut-être un variant de i + j qui ne pourraient
>     jamais exister dans un TLD, ou bien peut-être tout à fait hors de
>     notre répertoire.
>
>     ·On va attendre seulement jusqu’à ce qu’on ne reçoive les comments
>     informels du IP avant d’inviter des comments sur notre répertoire.
>
>
>     Regards,
>
>     Chris.
>
>     On 14/05/2016 10:50, Meikal Mumin wrote:
>
>         Dear colleagues,
>
>         so that clarifies that question - thanks Abdeslam.
>
>         Coming back to your questions Chris - I believe combining
>         marks could be excluded, as was done in the case of Arabic
>         LGR. Meanwhile case like ij could be declared variants with a
>         sequence of i + j, provided we see a need for including the
>         former.
>
>         If ligatures are no part of MSR-2, then I assume the problem
>         has solved itself.
>
>         Best,
>
>         Meikal
>
>     Dear colleagues,
>
>     I would suggest waiting for the feedback from IP, but not for
>     anything regarding second levels.
>
>     Best,
>
>     Meikal
>
>
>
>         2016-05-11 22:27 GMT+02:00 Abdeslam Nasri
>         <abdeslam.nasri at gmail.com <mailto:abdeslam.nasri at gmail.com>>:
>
>             Dear Chris and Colleagues,
>
>             Digraphs or more generally sequences of code points, can
>             be specified as variants of a single code point.
>
>             An excerpt from the LAGER specification :
>
>             "A sequence of multiple code points can be specified as a
>             variant of a
>
>                 single code point.  For example, the sequence of LATIN SMALL LETTER O
>
>                 (U+006F) then LATIN SMALL LETTER E (U+0065) might hypothetically be
>
>                 specified as a variant for an LATIN SMALL LETTER O WITH DIAERESIS
>
>                 (U+00F6) as follows:
>
>               
>
>                     <char cp="00F6">
>
>                         <var cp="006F 0065"/>
>
>                     </char>
>
>             "
>
>             In the typical case of digraphs these are named
>             precomposed versus decomposed formats of a single letter.
>             Normalization should exist in Unicode in order to allow
>             these variants, or otherwise block them.
>
>             Kind Regards,
>
>             Abdeslam NASRI
>
>             2016-05-09 15:43 GMT+02:00 Dillon, Chris
>             <c.dillon at ucl.ac.uk <mailto:c.dillon at ucl.ac.uk>>:
>
>                 Dear Meikal,
>
>                 Thank you for your thoughts on digraphs.
>
>                 In that case, we would have blocked variants like i,
>                 dotless i  and iota, where application for a label
>                 containing one, would block applications for labels
>                 containing any of the others.
>
>                 We would also have blocked variants, digraphs like
>                 ij,which could never be allocated at all. If we need to
>                 do this, it will be necessary to describe variants for
>                 ligature code points we have not yet analysed in the
>                 Latin ranges, as they aren’t in MSR2.
>
>                 (This distinction is what I was finding difficult
>                 during the face-to-face meeting in Marrakech.)
>
>                 Incidentally, I’m fairly sure two code points could be
>                 a variant of one. ( I wonder what happens with the
>                 Arabic ligature of laam and alif that looks like Greek
>                 gamma; in Urdu the two do not combine so closely, if
>                 at all.)
>
>                 Regards,
>
>                 Chris.
>
>                 --
>
>                 Research Associate in Linguistic Computing, Centre for
>                 Digital Humanities, UCL, Gower St, London WC1E 6BT Tel
>                 +44 20 7679 1599 <tel:%2B44%2020%207679%201599> (int
>                 31599) www.ucl.ac.uk/dis/people/chrisdillon
>                 <http://www.ucl.ac.uk/dis/people/chrisdillon>
>
>                 *From:*Meikal Mumin [mailto:meikal.mumin at uni-koeln.de
>                 <mailto:meikal.mumin at uni-koeln.de>]
>                 *Sent:* 09 May 2016 09:38
>                 *To:* Dillon, Chris <c.dillon at ucl.ac.uk
>                 <mailto:c.dillon at ucl.ac.uk>>
>                 *Cc:* latingp at icann.org <mailto:latingp at icann.org>
>                 *Subject:* Re: [Latingp] Digraphs
>
>                 Dear Chris and colleagues,
>
>                 apologies for the late reply. I believe we don't need
>                 to exclude digraphs. We could simply set them up as
>                 variants, e.g.  ij as equivalent of i + j. It could be
>                 useful to verify with IP, if it is possible to declare
>                 a sequence of two code-points as a variant of one - we
>                 had not encountered such a case with Arabic script.
>
>                 Best wishes,
>
>                 Meikal
>
>                 2016-03-29 9:54 GMT+02:00 Dillon, Chris
>                 <c.dillon at ucl.ac.uk <mailto:c.dillon at ucl.ac.uk>>:
>
>                     Dear colleagues,
>
>                     Mirjana’s recent research on Montenegrin has
>                     raised some interesting issues.
>
>                     One of them is diagraphs.
>
>                     Currently we have digraphs like æ and œ in our
>                     repertoire, but Dutch ij (U+0133) as in vijf ‘five’
>                     is white in MSR-2 (not compatible with IDNA 2008).
>                     Certainly many digraphs, including ij are visually
>                     similar to their component letters. We could
>                     consider adding all digraphs to the list of
>                     criteria for exclusion, or adding them with
>                     exceptions (less good from a usability point of
>                     view). Incidentally, ß and & are probably excluded
>                     for other reasons, Longevity Principle and
>                     Punctuation, respectively.
>
>                     What do you think?
>
>                     Français: Qu’est-ce qu’on devrait faire avec les
>                     digraphs dans notre répertoire – les permettre ou pas?
>
>                     Regards,
>
>                     Chris.
>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/latingp/attachments/20160516/44f94052/attachment-0001.html>


More information about the Latingp mailing list