[tz] zdump new option -i for easier-to-review output

Paul Eggert eggert at cs.ucla.edu
Sun Jun 5 23:51:33 UTC 2016


Jon Skeet wrote:
> The use case I'm
> primarily interested in is validation: diffing a "golden" file with one
> generated by another tool

Yes, I should have mentioned that. I commonly compare two zdump output files 
using "diff", for example. zdump -i works well for this, too. However, it does 
not suffice to merely look at diff output. Sometimes we add new zones, for 
example, and diff output won't serve to proofread those.

> I wouldn't expect them to be dealing with this format every day

True; even I don't do that. Still, there is no need for zdump -i format to be 
self-explanatory. For example, the format need not use strftime %c format merely 
because naive users are more likely to understand %c format than ISO 8601 
format. As long as the format is reasonably clear without constantly having to 
refer to the documentation then we should be OK, and zdump -i format clears that 
relatively-low bar.

 > - I don't see why we need the quoted form for the time zone ID.

The API allows the TZ environment variable (the time zone ID) to be any finite 
sequence of non-null bytes. TZ need not be UTF-8 encoded, and the bytes can 
contain newlines, etc., and zdump output should be unambiguous regardless of how 
weird TZ's value is.

 > Presumably the benefit of the proposed format is that you can copy/paste it
 > into a Unix shell to use that time zone.

No, and in general such a cut-and-paste would not work because the quotation 
scheme is not designed to be shell-compatible. The main goal is to have an 
unambiguous format that supports any TZ value allowed by the API. Also, to 
provide some room for future extensions to zdump -i format.

 > the quotes and TZ= part are an unnecessary distraction IMO.

Some decoration is needed in order to make it easy to distinguish a TZ= line 
from an ordinary data line. This is because a TZ string can be almost anything: 
it can look like a data line, for example.

Anyway, if this is the worst of zdump -i's problems, we should be OK.

 > - Indicating daylight/standard with an arbitrary positive integer: if this is
 > going to be a canonical format, we need to be more precise than that.
 > Equivalent outputs should be equal. I'd also prefer it not to be an integer
 > at all, given that it's indicating a Boolean value.

tm_isdst is defined by ISO C11 and by POSIX to be an int value, so if we want 
zdump to work with all standard-conforming implementations without losing 
information, it must be able to represent an arbitrary int somehow. The existing 
zdump -v format can do it, and it would be odd if zdump -i format were to lose 
that ability.

 > - I'd *really* like colons in the UT offsets

That is mostly just a style thing. That being said, in my experience most UT 
offsets that contain hours and minutes omit colons (this includes several 
examples in the RFC-5322-format header in your email :-).

 > - I think it's simpler to think about the transition times in UT, indicated
 > with a Z in the output.

That's not my experience. Most of our sources do not base transitions on UT, and 
I typically think about local time when mulling over transitions and DST rules.

 > choosing the local time *after* the transition isn't how most people think
 > about transitions in day to day conversation.

True. But it's easy to get used to when looking at zdump -i format. Plus, users 
most likely prefer localtime to UT when thinking about transitions.

 > Just the fact that there's ambiguity

The format is documented and if this documentation is understood correctly the 
zdump -i output has just one interpretation, so there is no ambiguity. A problem 
might arise if someone attempts to look at zdump -i output without reading the 
documentation; although such a problem could occur with any format choice, some 
formats are less confusing than others, and most likely that is what you're 
referring to.

To some extent there is a tradeoff between formats that make typos easy to find, 
and formats that are more what users typically expect. Within reason I'd rather 
make typos easy to find, as typos are a real probelm!

 > - Omitting the abbreviation when it happens to be the same as the UT offset
 > makes the file harder to parse for very little benefit in my view.

First, it's trivial to parse zdump -i lines even when the abbreviation is 
omitted. For example, here's an awk script that outputs only zdump -i lines that 
correspond to DST transitions even when abbreviations are omitted:

/^[0-9]/ && NF > 3 && /[0-9]$/ {print}

Compare this to an awk script to do the same thing with tzvalidate format:

/^[0-9]/ && $(NF - 1) == "daylight" {print}

which is not significantly simpler.

Second, I realize the improvement is of little benefit to those who do not read 
zdump output. But any unambiguous format would do for that case; we could pick 
JSON format, or XML format, or whatever. Being somewhat old-fashioned I'd like a 
text format that makes it easy for me to read zdump -i format using an ordinary 
text editor. And for me, it's quite useful that redundant abbreviations are 
omitted. Consider, for example, this output:

1981-04-01 01 +07 1
1981-09-30 23 +06
1982-04-01 01 +07 1
1982-09-30 23 +06
1983-04-01 01 +07 +08 1
1983-09-30 23 +06
1984-04-01 01 +07 1
1984-09-30 02 +06

where the (incorrect) 1983-04-01 transition sticks out like a sore thumb. In 
contrast, if the abbreviation were always output and columns always lined up, 
and the output looked like this:

1981-04-01 01 +07 +07 1
1981-09-30 23 +06 +06 0
1982-04-01 01 +07 +07 1
1982-09-30 23 +06 +06 0
1983-04-01 01 +07 +08 1
1983-09-30 23 +06 +06 0
1984-04-01 01 +07 +07 1
1984-09-30 02 +06 +06 0

the same typo is *much* harder to spot.

So it is not "very little benefit". It's a big deal to someone like me who wants 
to catch typos and who has to deal with the consequences of typos.

 > for times, I'd favour at least keeping the minutes

I was tempted by that too, on the grounds that it's what readers typically 
expect. However, it makes typos harder to catch, which is a significant 
disadvantage.

I hope I've explained the significant technical advantages of zdump -i format 
for my use case (manually looking at zdump -i output, and looking at diffs of 
it). I am not surprised that its style is offputting, which is why I'm thinking 
that we may need a way for people to specify output style more flexibly than 
zdump -i versus zdump -v versus zdump -V.




More information about the tz mailing list