[tz] Proposal: validation text file with releases

Sat Jul 11 10:35:44 UTC 2015

Background: I'm the primary developer for Noda Time <http://nodatime.org> which
consumes the tz data. I'm currently refactoring the code to do this... and
I've come across some code (originally ported from Joda Time) which I now
understand in terms of what it's doing, but not exactly why.

For a little while now, the Noda Time source repo has included a text dump
file
<https://github.com/nodatime/nodatime/blob/master/src/NodaTime.Test/TestData/tzdb-dump.txt>,
containing a text dump of every transition (up to 2100, at the moment) for
every time zone. It looks like this, picking just one example:

Zone: Africa/Maseru
LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00)
SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00)
SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00)
SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01)
SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00)
SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01)
SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)

I use this file for confidence when refactoring my time zone handling code
- if the new code comes up with the same set of transitions as the old
code, it's probably okay. (This is just one line of defence, of course -
there are unit tests, though not as many as I'd like.)

It strikes me that having a similar file (I'm not wedded to the format, but
it should have all the same information, one way or another) released
alongside the main data files would be really handy for *all* implementors
- it would be a good way of validating consistency across multiple
platforms, with the release data being canonical. For any platforms which
didn't want to actually consume the rules as rules, but just wanted a list
of transitions, it could even effectively replace their use of the data.

One other benefit: diffing the dump between two releases would make it
clear what had changed in *effect*, rather than just in terms of rules.

One sticking point is size. The current file for Noda Time is about 4MB,
although it zips down to about 300K. Some thoughts around this:

   - We wouldn't need to distribute it in the same file as the data - just
   as we have data and code file, there could be a "textdump" file or whatever
   we'd want to call it. These could be retroactively generated for previous
   releases, too.
   - As you can see, there's redundancy in the format above, in that it's a
   list of "zone intervals" (as I call them in Noda Time) rather than a list
   of transitions - the end of each interval is always the start of the next
   interval.
   - For zones which settle into an infinite daylight saving pattern, I
   currently generate from the start of time to 2100 (and then a single zone
   interval for the end of time as Noda Time understands it; we'd need to work
   out what form that would take, if any). If we decided that "year of release
   + 30 years" was enough, that would cut down the size considerably.

Any thoughts? If the feeling is broadly positive, the next step would be to
nail down the text format, then find a willing victim/volunteer to write
the C code. (You really don't want me writing C...)

Jon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/tz/attachments/20150711/ddce4cec/attachment.html>