[tz] Formal models and the need for storing original source data

Sun Jun 21 00:05:34 UTC 2020

>
> ---------- Forwarded message ----------
> From: Robert Elz <kre at munnari.oz.au>
> To: Lester Caine <lester at lsces.uk>
> Cc: tz at iana.org
> Bcc:
> Date: Wed, 03 Jun 2020 19:31:43 +0700
> Subject: Re: [tz] How good is pre-1970 time zone history represented in TZ
> database?
>     Date:        Wed, 3 Jun 2020 12:22:32 +0100
>     From:        Lester Caine <lester at lsces.uk>
>     Message-ID:  <3576438c-e0bd-7188-2153-43d80e9054b8 at lsces.uk>
>
> There are two kinds of
> changes that may be made to a zone.   One affects only future timestamps,
> and is the common old garden variety "government interference" or however
> you think of it.  These are far and away the most common changes.  While
> when we have insufficient notification of a change, it may be applied
> retrospectively, when that has happened, everyone tends to be very aware
> that there were bad time conversions for a while.   In any case, old
> historic stored data isn't affected at all by this kind of change, it
> is as valid (or not) after the change as it was before.
>

You can also have data recorded in the past which refers to the future, so
even
a change that affects only future timestamps can still invalidate existing
stored data.

> The second kind of change is a correction to historic data.  This happens
> when
> we discover an error in what was present (and these days, almost only ever
> affects pre-1970 timestamps).
>
> In those, if someone had stored the UTC converted form of some local
> timestamp,
> then after the correction they wouldn't get back the data that was
> originally
> used to produce it.
>
> The problem there is having discarded the original data instead of
> retaining
> it.   Always retain the original source data.   Then by all means, when
> computing, convert timestamps from their various local values to UTC so
> they can be more easily correctly ordered (or whatever) but use those
> converted values only for transient computations.   Store the original.
> Always.
>
> If that is done correctly, then after a correction to old data, the
> results might be different than they were before - but that's only because
> they were wrong before, and (hopefully) better after the fix.
>

I agree 100%.  I've been struggling for years to really get my head wrapped
around this stuff, and finally decided to create a formal model using
abstract
data types.  I'm finding that approach to be quite helpful, so I wrote it
up in a
couple of essays that folks on this list might find interesting.  In fact,
the folks
on this list may well be the *only* people who would find them
interesting.  :-)

The first is aimed at the general programmer, but it introduces the
fundamental
concepts:
https://drive.google.com/file/d/1WntAyhIawYtbL2k3fPE71cM9EFkVCJGQ/view?usp=sharing

The second is aimed at time geeks and goes much deeper into the underlying
theory:
https://drive.google.com/file/d/1aOj9YeDFUST0lQFXZiUsCSmxjnzlbIe4/view?usp=sharing

These are still in draft form, but I hope others find some value in them.
(Teaser:
the model makes no reference to years, days, hours, or seconds.  How's that
for
abstract?)

What is rather interesting (and reassuring) is that the formal model ends
up telling
us pretty much what this thread has been saying:  store original data.
However,
by looking at things in terms of abstract data types, you can easily see
*why* you
need to do that without having to resort to specific examples, and many
seemingly
unrelated cases can be seen to be special cases of the same underlying
phenomenon.
The model also points you to the very specific places in your design and
code
where you have to watch out for problems.

> The only time it makes sense to store timestamps in other than the original
> form is when we *know* that the conversion is correct (and hence, no later
> correction will change it).   For users of tzdata that really only applies
> to post-1970 timestamps.
>

When exactly do we *know* anything, I mean really for sure?

Imagine yourself back in 1966.  UTC has been chugging along
nicely for five years, providing a well-behaved predictable time
standard.  We know it's never going to change.

Now it's 1967.  Oops, the name just changed.  Well, that doesn't
really count, a rose by any other name ...

Now it's 1972.  Oops again, we just added a leap second, and
there are more on the way.  Sorry unix time_t, your fundamental
assumption was just revoked -- and so for the next 50 years, nearly
every computer on the planet will stubbornly refuse to accept the
existence of leap seconds.

Never say never.

> ---------- Forwarded message ----------
> From: Robert Elz <kre at munnari.oz.au>
> To: Lester Caine <lester at lsces.uk>
> Cc: tz at iana.org
> Bcc:
> Date: Wed, 03 Jun 2020 23:48:42 +0700
> Subject: Re: [tz] How good is pre-1970 time zone history represented in TZ
> database?
>     Date:        Wed, 3 Jun 2020 16:02:45 +0100
>     From:        Lester Caine <lester at lsces.uk>
>     Message-ID:  <3b7f0c78-4dd7-0f2e-a8e3-2b24401e7e1c at lsces.uk>
>
>   | I came into this 20 years ago
>
> I've been involved with it for longer than that - back to my first
> unix experience, in '76, where the US tz rules were compiled into the
> code, and most people in AU simply adjusted their computer's clock
> (their offset from UTC) 4 times a year (when the US switched summer time
> on and off, and when AU did - and yes, that meant that the generated GMT
> timestamps were wrong, most of the time).   From that (a bit later) I was
> responsible for the mess that existed until ado invented tzdata (and yes,
> I mean the 2nd arg to gettimeofday()).
>

You make me look like a newbie.  I first got into this seriously in 1997
while
updating legacy systems to handle Y2K.

> >From all of this I have learned that time is hard.   Really hard.
>
> Many people believe that since they learned to tell the time when
> they were 4 or 5 years old, and have been doing it ever since, they
> know all there is to know.   That's sad...
>

See the discussion of seduction and polar bears in the first essay
linked above.

>   | now while working with a data archive
>   | which has now been simply dumped because we had no idea what rules were
>   | used to produce the normalised data.
>
> That's a pity, byt sometimes past mistakes simply come back to bite,
> and sometimes bite hard.   Note that the error there was normalising the
> data, if that hadn't been done, none of the rest of it would matter,
> you'd now have the original data and could manipulate it however seems
> best, for now, regardless of what anyone did with it decades ago (and if
> you get it all wrong, future generations could cope, because they'd also
> still have the original data, and can fix any errors).
>
>   | Nowadays yes it does make sense to
>   | store both an original time and a normalised time,
>
> No, it doesn't, just the original, plus ...
>
>   | along with a location,
>
> yes, something which can be mapped into a timezone - and as accurate
> a location as possible.
>
>   | and a record of which version of rules was used to do the
>   | normalization.
>
> Don't care about that, since the result won't be being saved.
>
>   | Add to that a flag that indicates if the UTC time is fixed!
>
> If the UTC time is the authoritative one, that is what is stored.
> No need for extra flags.   Just the authoritative time - the one
> which defines whatever it is that is being recorded.
>

I second everything Robert Elz said here.

 - Michael Kenniston
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/tz/attachments/20200620/46d88ff2/attachment.html>