martin.jerabek at isis-papyrus.com
Tue Jul 1 09:27:18 UTC 2008
On 01.07.2008 10:53, Julian Cable wrote:
> So I have a practical preference for the 7-bit subset of UTF-8 with no BOM (of course I would never
> dream of calling this ASCII ;)
Well, the 7-bit subset of UTF-8 with no BOM *is* ASCII, so we might as
well call it ASCII. ;-) Pure 7-bit ASCII would of course be the most
portable encoding but in 2008 we should not longer have to deny
non-English  speakers and countries the correct spelling of their
names and places.
> If we go for UTF-8 can we be very firm about whether a BOM is required or prohibited and please make sure its
> not permitted.
Yes, definitely. One of the biggest advantages of UTF-8 is that programs
which do not support UTF-8 can usually still process UTF-8-encoded
files. There are no embedded zero bytes, and the bytes of a multi-byte
character are never equal to 7-bit ASCII characters. If a tzdata file
suddenly started with hex EF BB BF, the parser would try to interpret
these bytes as the start of a rule, and fail.
I understand the tendency of using an encoding mark for Unicode files in
the Microsoft world, and it is very useful for UTF-16 and UTF-32, but
(1) UTF-8 has only one byte order, and (2) adding it would cause more
problems than it is worth. I assume that Windows editors which support
UTF-8 can also be manually switched to UTF-8 without the need for a BOM.
 Yes, there are a few languages other than English whose script only
needs 7-bit ASCII.
More information about the tz