[tz] Proposal: validation text file with releases

Sat Jul 18 19:40:52 UTC 2015

Next update: I've improved the zdump-based generation of the data, and put
the data in the current format for all the tz data releases I can find
(from 1996 onwards) at http://nodatime.org/tzvalidate/

None of this is in any way meant to imply that I'm trying to freeze the
format - I appreciate Paul's point about it being an incomplete
representation of the zic data - but I wanted to get the data such as it is
already out there.

Noda Time 2.0 alpha now correctly generates the 2012e, 2013e, 2014e and
2015e data - I want to do a bit of work to make it easier to consume the
source data as a tgz directly before I then check it against all of the
rest of the zdump output files.

Jon

On 14 July 2015 at 21:12, Jon Skeet <skeet at pobox.com> wrote:

> I've expanded this a bit - we now have implementations for:
>
>    - Joda Time
>    - Noda Time
>    - Java 7 (well, Java pre-8)
>    - Java 8
>    - ICU4J
>    - zdump
>    - Ruby's tzinfo gem
>
> I'd really appreciate any input at this point. There are still a few
> issues with the data collection - it's not the pristine file diff we'd like
> to end up with - but it's enough to highlight some discrepancies, which
> I'll probably write up as a blog post and cc here. I think the fact that it
> *is* showing up these differences is evidence that this could provide a
> lot of value with the support of the rest of the community (and with a
> better implementation of my zdump munging - ideally something in zic
> itself, I suspect). Who do I need to persuade? (Paul, I guess...)
>
> Jon
>
>
> On 13 July 2015 at 21:43, Jon Skeet <skeet at pobox.com> wrote:
>
>> Okay, I've created
>> https://github.com/nodatime/tzvalidate
>>
>> It allows you (well, someone who's got everything set up...) to compare
>> and contrast:
>>
>>    - Joda Time
>>    - Noda Time
>>    - Java 8
>>    - zdump
>>
>> Only Joda Time and Noda Time allow (and in fact require) a data version
>> to be specified. Obviously in order to compare data meaningfully, one has
>> to be using the same data in all places. That's the next thing to look
>> at... but they're all using the same output format, and the results are
>> already interesting in terms of some unexpected discrepanicies. I haven't
>> had a chance to look into them yet.
>>
>> Jon
>>
>>
>> On 13 July 2015 at 16:06, Jon Skeet <skeet at pobox.com> wrote:
>>
>>> Given that I've already found discrepancies (see "Discrepancies in time
>>> zone data interpretation") I'm going to go ahead and hack on this in purely
>>> pragmatic (read: short term) ways. I'll create a github repo just for this
>>> purpose and dump code in there - this is explicitly with the aim of
>>> encouraging a more permanent solution by proving value.
>>>
>>> Will post another message here when there's something worth looking at -
>>> I'll be initially looking at zdump output, Joda Time, standard Java, and
>>> Noda Time. Contributions from others for other languages/platforms will be
>>> very welcome.
>>>
>>> Jon
>>>
>>>
>>> On 13 July 2015 at 14:46, Stephen Colebourne <scolebourne at joda.org>
>>> wrote:
>>>
>>>> FWIW, I think such a format would be very useful. Effectively, it is a
>>>> unit test for others to confirm that they interpret the rules the same
>>>> way as intended.
>>>>
>>>> It is similar to what I produced when trying to demonstrate the amount
>>>> of change being caused by apparently "minor" changes to the data:
>>>> https://github.com/jodastephen/tzdiff/commits/master
>>>>
>>>> Any output of this type should indeed just consist of a simple text
>>>> file with ISO-8601 format timestamps.
>>>>
>>>> Stephen
>>>>
>>>>
>>>>
>>>> On 11 July 2015 at 11:35, Jon Skeet <skeet at pobox.com> wrote:
>>>> > Background: I'm the primary developer for Noda Time which consumes
>>>> the tz
>>>> > data. I'm currently refactoring the code to do this... and I've come
>>>> across
>>>> > some code (originally ported from Joda Time) which I now understand
>>>> in terms
>>>> > of what it's doing, but not exactly why.
>>>> >
>>>> > For a little while now, the Noda Time source repo has included a text
>>>> dump
>>>> > file, containing a text dump of every transition (up to 2100, at the
>>>> moment)
>>>> > for every time zone. It looks like this, picking just one example:
>>>> >
>>>> > Zone: Africa/Maseru
>>>> > LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00)
>>>> > SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00)
>>>> > SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00)
>>>> > SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01)
>>>> > SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00)
>>>> > SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01)
>>>> > SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
>>>> >
>>>> > I use this file for confidence when refactoring my time zone handling
>>>> code -
>>>> > if the new code comes up with the same set of transitions as the old
>>>> code,
>>>> > it's probably okay. (This is just one line of defence, of course -
>>>> there are
>>>> > unit tests, though not as many as I'd like.)
>>>> >
>>>> > It strikes me that having a similar file (I'm not wedded to the
>>>> format, but
>>>> > it should have all the same information, one way or another) released
>>>> > alongside the main data files would be really handy for all
>>>> implementors -
>>>> > it would be a good way of validating consistency across multiple
>>>> platforms,
>>>> > with the release data being canonical. For any platforms which didn't
>>>> want
>>>> > to actually consume the rules as rules, but just wanted a list of
>>>> > transitions, it could even effectively replace their use of the data.
>>>> >
>>>> > One other benefit: diffing the dump between two releases would make
>>>> it clear
>>>> > what had changed in effect, rather than just in terms of rules.
>>>> >
>>>> > One sticking point is size. The current file for Noda Time is about
>>>> 4MB,
>>>> > although it zips down to about 300K. Some thoughts around this:
>>>> >
>>>> > We wouldn't need to distribute it in the same file as the data - just
>>>> as we
>>>> > have data and code file, there could be a "textdump" file or whatever
>>>> we'd
>>>> > want to call it. These could be retroactively generated for previous
>>>> > releases, too.
>>>> > As you can see, there's redundancy in the format above, in that it's
>>>> a list
>>>> > of "zone intervals" (as I call them in Noda Time) rather than a list
>>>> of
>>>> > transitions - the end of each interval is always the start of the next
>>>> > interval.
>>>> > For zones which settle into an infinite daylight saving pattern, I
>>>> currently
>>>> > generate from the start of time to 2100 (and then a single zone
>>>> interval for
>>>> > the end of time as Noda Time understands it; we'd need to work out
>>>> what form
>>>> > that would take, if any). If we decided that "year of release + 30
>>>> years"
>>>> > was enough, that would cut down the size considerably.
>>>> >
>>>> > Any thoughts? If the feeling is broadly positive, the next step would
>>>> be to
>>>> > nail down the text format, then find a willing victim/volunteer to
>>>> write the
>>>> > C code. (You really don't want me writing C...)
>>>> >
>>>> > Jon
>>>> >
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/tz/attachments/20150718/86e2618e/attachment-0001.html>