[tz] tabs in zone.tab

Fri Mar 2 09:44:01 UTC 2012

Robert Elz wrote:
>The final column is "comments" - there's no stated restriction on the characters
>that can be used in comments

True, and your interpretation with the comments beginning with a tab is
possible.  (I was half expecting DateTime::TimeZone (another repackaging
of the database for CPAN), which went out promptly after 2012a, to have
ended up with this interpretation, but it turns out it went with the two
tabs being a single separator.  I don't know how automated Dave Rolsky has
that release process.)  However, tab-separated-value tables conventionally
don't allow tabs to be part of the data, each tab being a separator.

Anyway, I was aware of some ambiguity here when I wrote my parser.
Quite apart from the tab issue, there's no stated restriction of the
comments to ASCII, but there's also no indication of which encoding
would be used for non-ASCII characters.  So I made the parser as strict
as possible based on the partial statement of the file format and the
(admirably regular) data actually seen.  This includes a restriction that
the comments contain only printable ASCII, and neither begin nor end
with whitespace.  On its face this isn't in accord with receiving half
of the Postel principle, but the failure mode here isn't a total failure
of operation, it's to kick the issue up for conscious human attention.
(It emailed me.)  The design is conservative in that I've told the parser
not to guess the meaning of anything irregular.

Rather than argue about what the current syntax definition means, when
it's plainly unclear on some of the details, I'd rather resolve this by
making the definition more detailed.  I suggest that it should be defined
to match the strict syntax to which the data has heretofore adhered,
and which my parser expects.  For reference, these are the Perl regexps
that I use to parse zone.tab (in Time::OlsonTZ::Download):

	$line =~ /\A([A-Z]{2})
		\t([-+][0-9]{4}(?:[0-9]{2})?[-+][0-9]{5}(?:[0-9]{2})?)
		\t([!-~]+)
		(?:\t([!-~][ -~]*[!-~]))?
	\n\z/x;
	$line =~ /\A#[^\n]*\n\z/;

We should also have an automated test, as part of tzcode, that checks
that the file matches whatever detailed syntax is decided, and that its
content is semantically sane (refers only to defined zones, for example).
I'm happy to translate my regexps, or equivalents for whatever other
syntax we agree on, into C for this purpose.

The same goes for iso3166.tab.

-zefram