[tz] Proposal: validation text file with releases

Wed Apr 27 15:52:36 UTC 2016

Thank you very much for this service Jon.  This is extremely valuable to me for validating my own IANA timezone database parser:

https://github.com/HowardHinnant/date

I do not know, but I suspect, that this would also be valuable for validating both tzcode and tzdata.  Any time that multiple independently developed pieces of software come up with exactly the same answer, that is generally a very good sign.  If a validation file such as this revealed a difference between tzcode and Noda Time (or my own parser), the bug would not necessarily lie with Noda Time (or my own parser).  A latent bug in tzcode might be revealed this way.

And this is also a very convenient way to check the differences between sequential versions of tzdata, to ensure that the intended changes are actually the changes seen in the list of transitions.

I consider the creation of this validation file and checking it against Jon’s independently created validation file, a critical test, for every single new version of the database:

https://github.com/HowardHinnant/date/blob/master/test/tz_test/validate.cpp

And I would be in favor of bundling such a validation file with either the tzdata or tzcode releases, or as a 3rd release alongside these two.

Thanks again Jon.

Howard

On Apr 27, 2016, at 10:56 AM, Jon Skeet <skeet at pobox.com> wrote:
> 
> For anyone still interested in this, I've now moved the data to http://nodatime.github.io/tzvalidate/ and created a Travis job which lets me update it mostly-automatically. (When there's a new TZDB release, I need to build the Noda Time data file, push that, then manually trigger a Travis build for tzvalidate.)
> 
> Of course, if there were any appetite for building and distributing this along with tzcode and tzdata, that would be even better :)
> 
> Jon
> 
> 
> On 11 July 2015 at 11:35, Jon Skeet <skeet at pobox.com> wrote:
>> Background: I'm the primary developer for Noda Time which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
>> 
>> For a little while now, the Noda Time source repo has included a text dump file, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
>> Zone: Africa/Maseru
>> LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00)
>> SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00)
>> SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00)
>> SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01)
>> SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00)
>> SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01)
>> SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
>> 
>> I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
>> 
>> It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for all implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
>> 
>> One other benefit: diffing the dump between two releases would make it clear what had changed in effect, rather than just in terms of rules.
>> 
>> One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
>> 	• We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too.
>> 	• As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval.
>> 	• For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
>> Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
>> 
>> Jon
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mm.icann.org/pipermail/tz/attachments/20160427/5b0d2ab7/signature.asc>