Non-ASCII encoding

Jonathan Leffler jonathan.leffler at gmail.com
Tue Jul 1 16:33:33 UTC 2008


On Tue, Jul 1, 2008 at 1:53 AM, Julian Cable <julian.cable at bbc.co.uk> wrote:

> -On [20080701 09:30], Martin Jerabek (martin.jerabek at isis-papyrus.com)
> wrote:
> >If more non-ASCII characters are going to be included in the tzdata
> >files, I would like to propose to define UTF-8 as the official encoding
> >of the tzdata files.
>
>
> In principle, I agree. In practice UTF-8 has at least one little quirk
> which has caused me problems:
>
> Microsoft operating systems always start UTF-8 encoded files with a Byte
> Order Mark (BOM) (http://en.wikipedia.org/wiki/Byte_Order_Mark)
>
> *nix-like operating systems never do (at least in my experience) and at
> least one perl-based xml parser running on Linux chokes on the BOM.
>

You've mis-characterized the problem.  UTF-8 doesn't have the quirk -- MS
operating systems have the quirk.  See:
http://unicode.org/faq/utf_bom.html#BOM

We can note one of the parting comments in the FAQ:

A particular protocol (e.g. Microsoft conventions for .txt files) may
require use of the BOM on certain Unicode data streams, such as files. When
you need to conform to such a protocol, use a BOM.

We can also note that none of the TZ data files are .txt files (because they
do not have the extension .txt in the file name) - and therefore do not need
the BOM.  Or a tool can be provided that stuffs a UTF-8 BOM (bytes 0xEF 0xBB
0xBF in that sequence) at the start of the file, transferring it to the MS
format.

MS operating systems are wrong - for all they represent a large proportion
of the installed o/s out there.  I'm not sure how often the Olson data are
handled on MS systems (probably more than I'd expect).

So, I would recommend that the code set is defined as UTF-8 without BOM in
files - and the files can be converted to UTF-8 with BOM (for use) on
systems that need the BOM.

-- 
Jonathan Leffler <jonathan.leffler at gmail.com> #include <disclaimer.h>
Guardian of DBD::Informix - v2008.0513 - http://dbi.perl.org
"Blessed are we who can laugh at ourselves, for we shall never cease to be
amused."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/tz/attachments/20080701/60bf16ac/attachment.htm>


More information about the tz mailing list