[tz] zdump new option -i for easier-to-review output

Sun Jun 5 08:26:51 UTC 2016

Right, I've now had a chance to do a bit more work on this. The various
options are committed in a github branch
<https://github.com/jskeet/nodatime/tree/tzvalidate-options> of Noda Time.

I have a few concerns about the proposed format, but I definitely agree
that we need to consider the audience and use cases. The use case I'm
primarily interested in is validation: diffing a "golden" file with one
generated by another tool. For example, to validate that Noda Time is doing
the right thing, I'd compare the output of zdump with the output of
NodaTime.TzValidate.NodaDump. Ideally, there will be no differences, so
nothing to look at. If there *are* differences, I need to be able to
understand them easily. Sometimes that will be missing lines on one side or
the other indicating a different number of transitions, sometimes it will
be differences between two lines (e.g. the transition point). In my use
case, one would rarely, if ever, be visually examining a single file to
look for anomalies, which is Paul's use case.

In terms of the users themselves - while I'd expect them to be
*somewhat* domain
experts (people writing date/time libraries) I wouldn't expect them to be
dealing with this format every day - so it should really be as clear as
possible without having to consult the man page each time. (I'd envisage
maybe having to look at the files once every six months or year.)

The other "user" to consider is machine readability: there are some cases
where it's very useful to be able to parse the file easily from code. For
example, some platforms I've looked at definitely get the abbreviation
wrong in many cases, so before diffing I remove the name. That's trivial to
do in the current format - but much harder when some parts are optional and
everything is variable width.

Regarding compactness: again, this comes down to use cases. I don't
particularly mind the file being reasonably large in total, so long as each
zone is simple to look at. (I don't want multiple lines per transition, for
example.) When zipped, there's not much difference between my original
format and the smallest one I've tested (128K vs 106K). If we can make it
more compact easily, that's fine - but I personally regard that as a much
lower priority than other aspects of the format.

Okay, concerns:

   - I don't see why we need the quoted form for the time zone ID. That's
   going to be a mild pain to generate robustly in terms of escaping, and it's
   not clear what would happen for non-ASCII characters anyway. Assuming we'll
   never get a line break as part of a zone ID, I think just including the ID
   in UTF-8 is the simplest plan. Presumably the benefit of the proposed
   format is that you can copy/paste it into a Unix shell to use that time
   zone. That's certainly not a use case that I'd personally find useful, but
   the quotes and TZ= part are an unnecessary distraction IMO.
   - Indicating daylight/standard with an arbitrary positive integer: if
   this is going to be a canonical format, we need to be more precise than
   that. Equivalent outputs should be equal. I'd also prefer it not to be an
   integer at all, given that it's indicating a Boolean value... where there's
   a number, there's an expectation (IMO) that the numeric value is
   meaningful. Just changing standard/daylight to s/d makes it a lot more
   compact, but I'd prefer std/day to be obvious. While we *could* omit the
   value for standard time, I still think there's a benefit in making every
   line consistent. Again, this comes down to a difference in use cases.
   - I'd *really* like colons in the UT offsets - "-103126" looks like a
   regular integer to me, whereas "-10:31:26" is fairly obviously 10 hours, 31
   minutes and 26 seconds.
   - Personally I think it's simpler to think about the transition times in
   UT, indicated with a Z in the output. In particular, choosing the local
   time *after* the transition isn't how most people think about
   transitions in day to day conversation. If I were describing the UK rules,
   I'd say that in spring we advance our clocks at 1am and in the autumn we
   move them back at 2am... whereas in this format, that would be shown as
   advancing the clocks *to* 2am and moving them back *to* 1am. Just the
   fact that there's ambiguity suggests to me that using UT everywhere is a
   clearer option. The "Z" on every line is redundant, but IMO it helps with
   clarity.
   - Omitting the abbreviation when it happens to be the same as the UT
   offset makes the file harder to parse for very little benefit in my view.
   That's taking compactness further than is useful.
   - In terms of omitting 0 minutes and 0 seconds values: for times, I'd
   favour at least keeping the minutes: "2016-06-05 21:00" still looks like a
   date and time, whereas "2016-06-05 21" looks like a date and then 21. This
   isn't as much of a concern with offsets though - "+05" is reasonably clear
   on its own.

Six sample formats to compare for Honolulu (one of the examples given in
Paul's man page), in the order of the commits in the github branch. The
number is the size of the file (including headers) for all zones. All of
these still represent the transition in UT:

"Original" (currently documented tzvalidate) - 1,735,616 bytes

Pacific/Honolulu
Initially:           -10:31:26 standard LMT
1896-01-13 22:31:26Z -10:30:00 standard HST
1933-04-30 12:30:00Z -09:30:00 daylight HDT
1933-05-21 21:30:00Z -10:30:00 standard HST
1942-02-09 12:30:00Z -09:30:00 daylight HDT
1945-09-30 11:30:00Z -10:30:00 standard HST
1947-06-08 12:30:00Z -10:00:00 standard HST

Short daylight and standard indicators - 1,463,421 bytes

Pacific/Honolulu
Initially:           -10:31:26 s LMT
1896-01-13 22:31:26Z -10:30:00 s HST
1933-04-30 12:30:00Z -09:30:00 d HDT
1933-05-21 21:30:00Z -10:30:00 s HST
1942-02-09 12:30:00Z -09:30:00 d HDT
1945-09-30 11:30:00Z -10:30:00 s HST
1947-06-08 12:30:00Z -10:00:00 s HST

Shorter offsets, but still with colons - 1,240,377 bytes

Pacific/Honolulu
Initially:           -10:31:26 s LMT
1896-01-13 22:31:26Z -10:30 s HST
1933-04-30 12:30:00Z -09:30 d HDT
1933-05-21 21:30:00Z -10:30 s HST
1942-02-09 12:30:00Z -09:30 d HDT
1945-09-30 11:30:00Z -10:30 s HST
1947-06-08 12:30:00Z -10 s HST

Shorter offsets, no colons - 1,236,955 bytes

Pacific/Honolulu
Initially:           -103126 s LMT
1896-01-13 22:31:26Z -1030 s HST
1933-04-30 12:30:00Z -0930 d HDT
1933-05-21 21:30:00Z -1030 s HST
1942-02-09 12:30:00Z -0930 d HDT
1945-09-30 11:30:00Z -1030 s HST
1947-06-08 12:30:00Z -10 s HST

Variable transition times, e.g. "21" instead of "21:00:00Z" (and changing
Initially to - -) - 972,361 bytes

Pacific/Honolulu
- - -10:31:26 s LMT
1896-01-13 22:31:26 -10:30 s HST
1933-04-30 12:30 -09:30 d HDT
1933-05-21 21:30 -10:30 s HST
1942-02-09 12:30 -09:30 d HDT
1945-09-30 11:30 -10:30 s HST
1947-06-08 12:30 -10 s HST

Variable transition times, but always keeping minutes - 1,079,278 bytes

Content is the same as the above, due to all the transitions happening on
the half hour...

To show the difference between the last two options, here's Pago_Pago:

Pacific/Pago_Pago
- - +12:37:12 s LMT
1879-07-04 11:22:48 -11:22:48 s LMT
1911-01-01 11:22:48 -11 s NST
1967-04-01 11 -11 s BST
1983-11-30 11 -11 s SST

vs

Pacific/Pago_Pago
- - +12:37:12 s LMT
1879-07-04 11:22:48 -11:22:48 s LMT
1911-01-01 11:22:48 -11 s NST
1967-04-01 11:00 -11 s BST
1983-11-30 11:00 -11 s SST

(I'd prefer to keep the Z in there, admittedly - that wasn't an option I
happened to code though. It's easy enough to imagine it...)

With all that in mind, I would *personally* prefer to stick to the
currently documented tzvalidate format. For my use cases of diffing and
machine parsing, the fixed with format is useful, as is always specifying
both the daylight/standard indicator and the name. I could live with the
offset and time shortening, but I'd definitely prefer to have colons in the
offset, and to keep minutes in the time part.

Thoughts?

Jon

On 30 May 2016 at 22:59, Paul Eggert <eggert at cs.ucla.edu> wrote:

> Jon Skeet wrote:
>
>> I'd personally be willing to sacrifice a
>> certain amount of compactness for the sake of readability, but obviously
>> if
>> we can get the size down a bit*without*  losing readability, that would be
>> good.
>>
>
> Yes. Readability is to some extent in the eye of the beholder, and the
> proposed zgrep -i format wasn't my first choice: it evolved over some time
> as I used it to look at a lot of data. To some extent the format is aimed
> at my needs, and may be less suited for novices. For example:
>
> TZ="America/Phoenix"
> - - -072818 LMT
> 1883-11-18 12 -07 MST
> 1918-03-31 03 -06 MDT 1
> 1918-10-27 01 -07 MST
> 1919-03-30 03 -06 MDT 1
> 1919-10-26 01 -07 MST
> 1942-02-09 03 -06 MWT 1
> 1943-12-31 23:01 -07 MST
> 1944-04-01 01:01 -06 MWT 1
> 1944-09-30 23:01 -07 MST
> 1967-04-30 03 -06 MDT 1
> 1967-10-29 01 -07 MST
>
> Here the columns don't line up and although this may be a bit offputting
> for some, for me it's a plus as it causes the unusual WWII non-hour
> transitions to stand out. Also, it's easier to visually identify the
> daylight-saving transitions via "1" vs nothing, than to scan through a
> column saying "isdst=1" vs "isdst=0". In contrast:
>
> America/Phoenix
> Initially:           -07:28:18 standard LMT
> 1883-11-18 19:00:00Z -07:00:00 standard MST
> 1918-03-31 09:00:00Z -06:00:00 daylight MDT
> 1918-10-27 08:00:00Z -07:00:00 standard MST
> 1919-03-30 09:00:00Z -06:00:00 daylight MDT
> 1919-10-26 08:00:00Z -07:00:00 standard MST
> 1942-02-09 09:00:00Z -06:00:00 daylight MWT
> 1944-01-01 06:01:00Z -07:00:00 standard MST
> 1944-04-01 07:01:00Z -06:00:00 daylight MWT
> 1944-10-01 06:01:00Z -07:00:00 standard MST
> 1967-04-30 09:00:08Z -06:00:00 daylight MDT
> 1967-10-29 08:00:00Z -07:00:00 standard MST
>
> Although this conveys the same information, it's harder to catch
> anomalies, as the nicely-aligned columns and data tend to blur into each
> other. For example, it's hard to spot the error that I deliberately
> introduced into the penultimate line of that data, whereas the same error
> would have been much easier to see in zgrep -i format.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/tz/attachments/20160605/f85d8d6e/attachment-0001.html>