[tz] What is the data product? (was: Fractional seconds in zic input)

Tue Feb 6 13:07:37 UTC 2018

I agree that the data is key, and by that I mean the distributed 'zic input
data' (eg "southamerica"). However, I disagree strongly about:

>  no-one should really be overly concerned with the format in which we
write it

When a data file format is in very widespread use, changes to it are
extremely painful for downstream clients. I've had plenty of experience
with this with Unicode, BCP47, CLDR, and similar levels of internal changes
at the companies I've worked at. Seemingly trivial changes have a way of
screwing up lots of programs and millions of people.

*If the TZDB were not important, arbitrary changes would not matter. *But
it is a crucial part of the world's software stack; its very importance
cries out for stability. (As a trivial counterexample for "no-one should
really be overly concerned with the format", try changing the character set
of the files to EBCDIC and see how many squawks you get from users).

Now, there are ways to both expand the format and retain stability. Here
are a couple of ways to do that.

A. Bifurcate the data

   1. *Core. *Always make available a set of data files in the current
   format. No changes to support "advanced" features like SAVE<0, fractional
   digits, etc. No splitting IDs because of advanced features either.
   2. *Advanced. *The format of data files can change "with no concern", in
   order to support "advanced" features.

One way to make this practical is to always have a program that generates
the core data by filtering the advanced. It is important, however, such a
program strictly minimize the textual changes to the core, so that diffing
produces changes on the order of what it done now, for updates to country
rules.

B. Add conditionals

Another way is to have just one set of files, but have well-defined
"conditionals" to enable new features. Here is an example, just for
illustration:

# @ IF FRACTIONAL
# @ Rule Arg 2007 only - Dec 30 0:00.0000001 1:00 S
# @ ELSE
Rule Arg 2007 only - Dec 30 0:00 1:00 S
# @ END

The key to having that work is that older implementations will just ignore
the # @... lines, and newer implementations that want to support the
features can use them.

Mark

On Tue, Feb 6, 2018 at 12:14 PM, Robert Elz <kre at munnari.oz.au> wrote:

>     Date:        Mon, 05 Feb 2018 23:57:59 -0500
>     From:        scs at eskimo.com (Steve Summit)
>     Message-ID:  <2018Feb05.2357.scs.0001 at quinine.home>
>
>   | In the beginning I would have thought that the project's product
>   | was the database *and* the reference code,
>
> The reference code is a side issue, useful to show how to use the
> data, and to assist in verifying its correctness, but it is the data that
> matters.  Note: the data, not the format in which it is expressed, that's
> an even smaller side issue.
>
> The part of all of this that is difficult (aside from attempting to make
> the code work on a zillion different systems in a sane way) is actually
> collecting and, as best as is possible, verifying the data.
>
> That is all that really matters.   All the rest is just frills, and as
> anyone
> is free to take the collected data and write it down in whatever format
> they like, no-one should really be overly concerned with the format in
> which we write it, nor how often that changes.
>
> Getting as much data as possible, as accurately as we can determine
> it (down to tiny fractions of a second, when it matters) is all that is
> really
> important.  When the format cannot represent the data, we change the
> format, never compromise the data.  The format can also change just
> because something new happens to be more convenient.
>
> kre
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/tz/attachments/20180206/22cdb4c2/attachment.html>