[tz] OpenJDK/CLDR/ICU/Joda issues with Ireland change

Wed Jan 24 16:19:55 UTC 2018

  From: Stephen Colebourne <scolebourne at joda.org>
  Date: Tue, 23 Jan 2018 18:42:29 +0000
  Subject: Re: [tz] OpenJDK/CLDR/ICU/Joda issues with Ireland change

  | Java time-zone data is updated using the tzupdater tool
  |     [URL omitted here]
  | This will update the tzdb data, but not the CLDR-driven data that
  | drives the text.

That is most probably a mistake - the two should be linked, it
is entirely possible that a zone might change its names (regardless
of issues of when transitions occur, or what, if anything, is
regarded as the "standard" time).

  | Were the change to proceed, anyone running tzupdater
  | with the Ireland change would invert the meaning of inDaylightTime()
  | and access the wrong array element in the CLDR-driven data - a bug.

Yes, it would be, and CLDR or java (whichever has the issue, or both)
should fix it.  And fix it soon.

  | And code changes don't help, as we'll see below.

Of course code changes help - there's a bug, fixing the bug will fix that.

And also of course, for people who don't update, the bug will continue to
appear - as for any other bug, or security vunerability that is found and
fixed.  Nothing that we can do about that.  People who won't, or can't,
update get screwed by all kinds of things.

  | There is no possible fix to Java, as this is primarily an issue
  | between CLDR and TZDB. The two have a subtle API linkage which has
  | perhaps never been clearly spelled out here.

Yes, they do, that ought to be obvious - the linkage is not (or should
not be) subtle - it should be obvious.

  | CLDR provides textual names for time-zones, as an array [winter,
  | summer].

That itself is a bug.   It assumes there are just two (not including for
the "generic" name, mentioned in a later message from Yoshito Umaoka, which
is probably the more useful one of the three anyway) - and there is
no guarantee that will (or even always has) remain true.

There is nothing to stop some locality (probably one at a high latitude)
from deciding that they should advance the clocks in early spring, and
then advance them further in early-mid summer, returning to the intermediate
(or some other) value in late summer, and then to the original in late
autumn (or fall if autumn happens to be called that in the relevant
location).  What's more, they could give 4 different names to the 3 (or 4)
different offsets, perhaps "winter time" "spring time" "summer time" and
"autumn time" with 4 different abbreviations.

There could even be a mid winter fallback of even more, just as there
could be a mid summer skip forward of more.

Calling any of those offsets "standard" and the others as something
different is really nonsense, though the jurisdiction (and people)
might pick that label - but when they do, we should all remember that
it is just another name.  One offset is mot more blessed than any other
because it happens to be labelled as the "standard" time.  It might
be different if we defined "standard time" to be the nearest "natural"
offset based on lines of longitude - but with what resolution?  And how
would you apply that to China or India?   So we don't do that.  No-one does.

CLDR (and its clients) needs to be able to represent all this.  Tzdb can.
CLDR must also handle places which (given the durations of the two periods
that is common these days) decide that "standard" time be the one that
applies for longer each year, and so should be the time in summer, and
in winter the clocks should be set backwards some number of minutes for a
few months, so it does not remain dark quite so late in the mornings
("darkness saving time" - aka DST).

  | As a much larger project with considerable history the order
  | of that array is not going to change.

More than that needs to change, the order is not, or should not be,
material.

Just accept it - the design is broken, and must be fixed.

  | (I'm using winter and summer for CLDR for this email to aid clarity,
  | they refer to them as standard and daylight).

Either way exposes the broken assumption that there are just two.

  | TZDB provides the offsets, SAVE values and a short text string. This
  | text string - GMT/IST or IST/GMT - is not directly linkable to the
  | data CLDR provides.

It probably should be, probably when accompanied by the offset and
the relevant time (perhaps the offset is less needed, or useful),
those should be the key to the translated strings.   But not as
indexes into an array, that's just plain stupid.  As database
keys (for "database" in the general, not implying anything SQL based
or similar).

Alternatively, perhaps localized zoneinfo files should be used
instead, built from a modified zic, which embeds the localized
names (for some particular locality) with the raw data (probably
in a similar way to, or perhaps instead of, how the abbreviations are
handled now).

That would mean one set of zoneinfo files for each locality an
installation wants to support, but zoneinfo files are not really
all that big (and adding a few extra strings to them would not
make much difference) so this should not be seen as too much of
a drawback - then CLDR users would simply use those files instead
of the normal ones (if those even continue to exist on the system)
for all purposes.

This would obviously handle the problem of the two being updated
independantly fairly easily.

It does mean that if the "normal" files continue to exist, as both
cldr and older applications both exist on the system, then those
would need to be updated together.  This should not be a problem,
the update of one is simply not made available until both are ready.

  | Although it may seem that you can use the text
  | from TZDB as a key to lookup the correct value in CLDR, I know from
  | painful experience that approach fails (as the TZDB text varies over
  | time,

Yes, and when it does the CLDR strings ("translations" into local formats)
   [ translations in quotes as I know that is not exactly what they are ]
may need to change as well.  There are multiple reasons why the TZDB names
might change, some are, frankly, silly, but others represent real changes
in what the local users call their times.  In some cases the CLDR strings
may have already matched local expectations, and nothing needs to alter,
but in others the local's name might have changed (in their language,
as well as in English) and the CLDR strings need to be updated (augmented).

This is why the CLDR data should really be updated (if required) and
(always) transmitted whenever the tzdb (zoneinfo) data changes.

  | has the same text in winter and summer, or isn't even text).

I have no idea what the latter means - they are all text (we do not
define zone abbreviations as random binary), unless you mean the +04 types,
which are text, just text containing digits and +/- signs, rather than
only letters.

But you're right the "sometimes the same" (which is actually a very
sane choice) means that you cannot use the abbreviation alone to map.
However, the name, and the time to which it is being applied, is
enough (and perhaps to avoid running that time through localtime()
or its equivalent again just to get the offset, probably that as a param
as well.  We know localtime() must have been run already, or the data
currently used would not be available.)

  | Thus, the only reliable way to pick which piece of CLDR data is needed
  | is from the offsets.

Not even that alone, as the same offset can have different names during
different periods.   That (unlike some of my potential scenarios) has
actually been observed in the past, and CLDR needs to handle that we well.

It is simply untrue, and incorrect, to assume that if (in locaiity X)
times at offset N are called ABC and times at offset M are called DEF
today, than that was true last year.   The old and the new names need
to be available and applied to the appropriate times.  This is true just
as it is true that CLDR data is needed for more than calendaring
applications - the only thing that matters is not just when the next
meeting is schedueled (with the day and month, and timezone names
converted to the local correct forms.)

  | For 20 years, this has been done in a simple and straightforward way -
  | if (raw-offset != actual-offset) then CLDR uses summer text and array
  | element 1.

So, for 20 years there has been a latent bug.   If for 20 years there
has been a latent bug that allows a security breach, are you going to
simply say "it has been there too long, we can't fix it now" ?

Really?

It makes no difference how old it is, a bug is a bug, and needs to be fixed.

  | This provides the necessary glue to link the two projects:

It is the wrong glue.

  | TZDB has always had the raw and actual offsets

What on earth is the "raw offset"?

I somehow suspect that you (and perhaps CLDR in general) is reading
too much into the tzdb source files.

99.9999% of people (not being zic) should really be ignoring those
files, and everything they contain (the remaining percentage are the
people who maintain the data - all 10 or 20 or so of them in the world).

Everything else should be based upon the zoneinfo output files from
zic - and that has no notion of a "raw" offset at all, all that exists,
and all that you can ever assume, is that for some period of time
(or indefinite length, starting at arbitrary and often unpredictable
instants) a particular timezone will be at some offset from UTC.
It might also be associated with some name (in reality, many are not,
as Paul keeps pointing out, many of the abbreviated names that tzdb
contain were purely invented for tzdb, because the (US centric) UNIX
API/ABI required them - some of those are the ones being turned into
numeric offsets represented as text strings - it makes no difference
in the zone concerned, as there the time is just "the time" it has
no other name (we really should have no abbreviation at all, and CLDR
should have no translation of it).

  | the same in winter and different in summer,

Once upon a time, the world was always flat, everyone knew that,
the pope even proclaimed it...

  | so this has always worked.

The latent bug was not exposed.  That is not "worked" it is
rather "managed to survive".

  | The Ireland proposal breaks this, with (raw-offset != actual-offset)
  | meaning winter, instead of summer. It is fair for TZDB to complain
  | that CLDR is inflexible with its definitions, but the reality is that
  | this was and is the only way to connect two separately developed
  | projects (where API stability is vital).

Nonsense.   It was just someone's idea of something they thought
would work, and which seemed to - but it was based upon unfounded
(and incorrect) assumptions about the natire of civil time, and how
it can be expected to work.

  | In order for TZDB and CLDR to co-exist, it is *required* that the raw
  | offset equals the actual offset in winter,

No, it is *required* for CLDR to be fixed.  What is happening now is
obviously incorrect.

  | This isn't a change that can be delayed for a year.

Oh good, so we can make it now?

  | This interpretation of inSummerTime() relies on positive SAVE values,

So, fix it.  It is broken.

  | is part of the public API of TZDB just as much as the source code file
  | format is.

If that's all, then we have no problem, as the source file format
should not be regarded as part of anything except the method by which
we happen to represent the data before zic converts it to zoneinfo.

The source format has changed, and will change again - that is guaranteed.

The zoneinfo format (in binary form, or converted to text) is designed
to be immune to all of the schenanigans that go on, and really is
what everyone should be using.   If anyone believes that they need
the source files for anything other than feeding to zic (or some
equivalent program for systems that cannot run it, if there are any)
then that almost guarantees that they are making some unststainable
assumptions, which will, one day, be proven false.

We (of course) attempt to remain backward compatible, but as legislatures
(and the people under their governance) do weirder and weirder things, we
are likely to find that the current language is incapable of expressing
what needs to be expressed, and it ill be extended.

I know there are others that read it, but this should be treated in
a similar way to the way that compilers treat programming language
specifications - when the language is extended (as all that are not
dead have happen) the compilers all need to be updated to deal.
Similarly, when tax legislation is amended (about the only thing that
changes even more frequently, and for less rational reasons than
timezones) the accountants, and the software they use, needs to be
updated to deal with that.

Updates/changes are simply a fact of life, there is nothing that is
guaranteed (not really even death or taxes) that we can promise will
never change.  Hopefully zoneinfo files will not need much - though it
aready has changed when 64 bit time support was added, and might need
more, if people dealing 2038 issues find some innovative way to allow
32 bit timestamps to keep working, in some fashion, beyond 2038 in order
to retain compat with old databases that cannot be updated easily.

Everyone needs to remain aware of this.  Sticking our heads in the
sand and proclaiming "it always worked in the past, it must be made
to continue working in the future" is, frankly, absurd.

kre

ps: I am sure apologies will be needed, I have tried to find and
correct all my typos, but right now, my e-mail environment is
horribly challenged, and I have no way to rationally do spell or
grammar checks I normally would (well sometimes) attempt.  So,
consider that for any unfound mistakes, apologies are tendered.