<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">


<br class="">


<div class="">Before we leave the topic of Regular Expressions -</div>


<div class=""><br class="">


</div>


<div class="">If one does use regex then I consider it better working practice, when possible, to work at the Unicode level rather than the encoding level.</div>


<div class=""><br class="">


</div>


<div class="">One example previously given in this discussion thread was</div>


<div class=""><br class="">


</div>


<div class="">&quot;^([a-zA-Z0-9.!#$%&amp;'*&#43;/=?^_`{|}~\u00A0-\uD7FF\uE000-\uFFFF-]|([\uD800-\uDBFF][\uDC00\uDFFF]))&#43;$&quot;</div>


<div class=""><br class="">


</div>


<div class="">This regex is working at the encoding level, specifically UTF-16. It encompasses nearly every Unicode character, including unassigned Unicode codepoints and Private Use Area (PUA) characters. I would not allow unassigned or PUA characters in an


 identifier.</div>


<div class=""><br class="">


</div>


<div class="">I consider it better to work at the Unicode level. I previously gave a simple example of working at the Unicode level: &quot;\p{Devanagari}&#43;&quot; which will match with one or more Devanagari Script Unicode characters. In this case I do not need to concern


 myself with codepoints, encodings, unassigned codepoints, additions that may be made in newer versions of Unicode...etc... The regex engine and the Unicode consortium do that for me. It can and frequently does get more complicated than the simple example I


 have given. See&nbsp;<a href="http://www.unicode.org/reports/tr31/" class="">http://www.unicode.org/reports/tr31/</a></div>


<div class=""><br class="">


</div>


<div class="">André Schappo</div>


<div class=""><br class="">


</div>


</body>


</html>