Non-ASCII encoding

Tue Jul 1 09:27:18 UTC 2008

On 01.07.2008 10:53, Julian Cable wrote:
> So I have a practical preference for the 7-bit subset of UTF-8 with no BOM (of course I would never
> dream of calling this ASCII ;)
>   
Well, the 7-bit subset of UTF-8 with no BOM *is* ASCII, so we might as 
well call it ASCII. ;-) Pure 7-bit ASCII would of course be the most 
portable encoding but in 2008 we should not longer have to deny 
non-English [1] speakers and countries the correct spelling of their 
names and places.
> If we go for UTF-8 can we be very firm about whether a BOM is required or prohibited and please make sure its
> not permitted.
>   
Yes, definitely. One of the biggest advantages of UTF-8 is that programs 
which do not support UTF-8 can usually still process UTF-8-encoded 
files. There are no embedded zero bytes, and the bytes of a multi-byte 
character are never equal to 7-bit ASCII characters. If a tzdata file 
suddenly started with hex EF BB BF, the parser would try to interpret 
these bytes as the start of a rule, and fail.

I understand the tendency of using an encoding mark for Unicode files in 
the Microsoft world, and it is very useful for UTF-16 and UTF-32, but 
(1) UTF-8 has only one byte order, and (2) adding it would cause more 
problems than it is worth. I assume that Windows editors which support 
UTF-8 can also be manually switched to UTF-8 without the need for a BOM.

Best regards
Martin Jerabek

[1] Yes, there are a few languages other than English whose script only 
needs 7-bit ASCII.