[UA-discuss] Regular Expression

Andre Schappo A.Schappo at lboro.ac.uk
Fri Sep 15 13:01:45 UTC 2017


Before we leave the topic of Regular Expressions -

If one does use regex then I consider it better working practice, when possible, to work at the Unicode level rather than the encoding level.

One example previously given in this discussion thread was

"^([a-zA-Z0-9.!#$%&'*+/=?^_`{|}~\u00A0-\uD7FF\uE000-\uFFFF-]|([\uD800-\uDBFF][\uDC00\uDFFF]))+$"

This regex is working at the encoding level, specifically UTF-16. It encompasses nearly every Unicode character, including unassigned Unicode codepoints and Private Use Area (PUA) characters. I would not allow unassigned or PUA characters in an identifier.

I consider it better to work at the Unicode level. I previously gave a simple example of working at the Unicode level: "\p{Devanagari}+" which will match with one or more Devanagari Script Unicode characters. In this case I do not need to concern myself with codepoints, encodings, unassigned codepoints, additions that may be made in newer versions of Unicode...etc... The regex engine and the Unicode consortium do that for me. It can and frequently does get more complicated than the simple example I have given. See http://www.unicode.org/reports/tr31/

André Schappo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/ua-discuss/attachments/20170915/f76feccc/attachment.html>


More information about the UA-discuss mailing list