[tz] Tzdb and the Sunshine Protection Act

Fri Mar 3 22:47:52 UTC 2023

    Date:        Fri, 3 Mar 2023 10:11:12 -0800
    From:        Paul Eggert <eggert at cs.ucla.edu>
    Message-ID:  <5f1bb439-5100-7c0d-fc27-32ab2e5fee08 at cs.ucla.edu>

  | Thanks, I didn't know that. In other words, in the current POSIX 
  | standard TZ='EST5EDT' has unspecified behavior,

I think so, unless I missed something (I looked hard to find where it
said what happens, and found nothing, but not finding doesn't actually
mean not present, I might have been looking under the wrong bush).

Then the general rule is that if nothing is specified, it is unspecified
(the latter in the POSIX technical sense).   Note not undefined, something
semi-rational has to happen (not a core dump), the standard just doesn't
say what.

  | whereas in the draft 
  | next POSIX standard TZ='EST5EDT' has partly-specified behavior in that 
  | the implementation must only shuttle back and forth between standard 
  | time and DST via some schedule.

Yes, act as if the rule is there, somewhere, hidden and invisible,
and (ideally) actually specifying something sane.   The way tzcode/tzdata
handles this, as an (effective) alias for America/New_York is fine,
nothing says that the implementation defined rule needs to be as limited
as the POSIX string would be.

Since it is implementation defined (will be) the implementation needs to
say somewhere what happens, then its users will know if using a TZ spec
like that, with no attached rule, but with summer time specified to happen,
is adequate for their needs.   If so it makes sense to use it, as the
implementation is more likely to adapt to changes in the regulations than
the user's .profile file.

  | If I understand things correctly, the draft allows for more than two 
  | transitions per year, e.g., one for Ramadan and another for summer as 
  | Morocco used to do. (Or is this really required? could an implementation 
  | use permanent standard time? or permanent DST? it's not clear from the 
  | text you quoted.)

Nothing in POSIX (aside from the POSIX TZ string definition) defines what
a timezone can be, or what the rules are (C defines even less).   That's
hardly surprising, those things are defined by various national (or similar)
legislative or administrative bodies, and are completely out of POSIX's
sphere of influence.   The US Dept of Commerce (is it?) doesn't care if
it conforms to POSIX nor not, it isn't seeking the magic badge of approval.

All that matters is that somehow there be a mechanism that will convert
a time_t into a struct tm, in some specified timezone, according to the
rules of that timezone, as set by whoever.   And vice versa.

The closest POSIX comes (C has none of this) is that specification of
the TZ string format, which allows simple cases to be specified for
times for which the current rules are adequate (provided the needed
rules are simple enough).

Eg: someone's legislation might state that summer time begins at 01:00
on some particular Sunday (say the last Sunday of some month), with the
time skipping forward to 02:00.   That's all simple enough, and exactly
the kind of thing we're used to.   But let's suppose the legislation also
says "If the last Sunday of the month is the last day of the month, summer
time will instead start at 03:00 which becomes 04:00".   The rules in a
POSIX TZ string cannot handle that.   In tdata (as I understand it) we
can't handle it either - other than by manually inserting a one off rule
every time the last Sunday of the month when summer time starts happens
to be the last day of that month, and then reverting to a LastSun 01:00
rule for the following years, until it happens again.   Once that's done,
everything will be fine, but there's nothing automated about it.

  | That could lead to problems, as Internet RFC 8536 relies on POSIX TZ 
  | format,

If it relies upon it by reference, then it should probably start being
updated to specify whatever it needs itself.   Just in case.   But there's
no hurry.  That would be a good thing to do in any case, relying upon
someone else not making a change which might break your usage doesn't
seem like the right thing to do to me.

Do note that it will certainly be at least a decade, more likely 2 or 3
decades, and perhaps even more, before this would actually happen - the
format needs to be marked obsolete first, and and even that hasn't happened
yet (if it doesn't happen in the next standard, expected next year now
(Posix-2024 perhaps ... Issue 8 certainly) then it won't until (at least)
Issue 9, which (my guess would be) won't happen until the middle 2030's
at the earliest (more likely late 2030's). Then (possibly) after having
been marked  obsolete in Issue 9, it might be removed in the next one
(Issue 10, 2050's sometime perhaps) or the one after (Issue 11, late 2060's) ...

There is LOTS of time to get everything else in place before anything
changes here (and I am still just speculating that it ever will - it
simply seems like a logical future step to me).

  | and the format is embedded in the TZif files interpreted by tzcode

That's harmless - removing it from POSIX doesn't mean it must stop
working, even less that its use in TZif files needs to end.   If anything
just the contrary, if things are no longer constrained by the POSIX spec,
and if there's a need, that format could be extended to handle more than
two transitions per year, or rules that are more complex than the ones
POSIX allows to be specified (of course, you could do that anyway, update
the RFC and TZif files don't need to be constrained by POSIX regardless
of what happens).

  | and by lots of other downstream code.

What kind?   I doubt that anything other than tzset() and related
stuff ever parses a TZ string contents, though I guess someone might
have written a TZ string -> what it means converter, to help users
get that right.

  | For example, on the Ubuntu 
  | workstation I'm typing this message on, /usr/share/zoneinfo/Europe/Paris 
  | contains the string 'CET-1CEST,M3.5.0,M10.5.0/3' and glibc uses this 
  | string to process future time stamps.

That's fine - but you don't need the definition in POSIX to do that.

All that removing it does is tell users that they cannot necessarily
expect a string of that format to work.   It still might, as what will
be left, with that gone, is "If the TZ value starts with a ':' what it
means is implementation defined, if it doesn't then it, by magic means
unspecified here (possibly the IANA tzdata database), the implementation
will discover from the value a means to convert between time_t and local
time.   If the implementation wants to keep parsing old style POSIX
strings (and for backwards compat for any users who use them, most will
I would guess), that is just fine.

As far as we're concerned here, nothing needs to change at all.

  | I suppose if POSIX stops specifying strings like this, we could move the 
  | spec to the successor of RFC 8536. But what would be the point?

As long as it remains in POSIX, users can keep insisting upon their
right to use those strings, and implementations (even ones not based
upon tzcode, which have no use for that nonsense at all) have to keep
supporting it.    This is exactly the same rationale as lots of other
ancient crud has been retired from the standard over time.   POSIX
used to specify uucp ... it doesn't any more.   That doesn't mean that
an implementation cannot continue to support that, it just means that
users can't complain about a POSIX violation if they choose not to.

Same here with POSIX TZ strings.

  | Every 
  | tzcode-like implementation would still need to parse such strings, and 
  | there seems little point to deprecating the exposure of that parser to 
  | the user.

For tzcode, perhaps - but nothing anywhere requires that only tzcode be
used to provide the translation service.   A different implementation of
a similar service, in a world where POSIX no longer specifies its TZ
string format, would not need to parse those things.  Why would it?

  | So it would make sense to keep it in POSIX, to support those use cases.

No, it doesn't.   It might make sense to keep it in the implementation
to support those cases, it doesn't need to be in the standard for that
to happen.

Yes, but you are misinterpreting what "std" is.   That is not the abbreviation
(or tzname, or whatever one wants to call it), it is the field of the TZ
string in which that name is specified.   If there are no quoting chars,
then it turns out the two are the same, the contents of the field (provided
it meets the other requirements) is the abbreviation.   If it is quoted, then
the charset restrictions are relaxed (not just alpha chars).

I read the sentence you quoted, as meaning "the abbreviations extracted
from the std and dst fields shall not include...".   It does say that
std and dst must be at least 3 bytes, but that is earlier, before it
starts on the format of those fields (which is irrelevant if they aren't
at least 3 bytes long).   That (at least 3 byte) std field might start
with a '<' and end with a '>' in which case the tzname (abbreviation,
whatever) is what is between (assuming correct syntax).

Note the length limit is (or will be) bytes, not characters - not in the
version that is coming, where an effort has been made to me more careful
about the difference between a byte and a character, and use the intended
word in each case, not just assume they are the same thing, which much of
old POSIX used to do, and use the words interchangeably, preferring
"character" when text was being discussed, and "byte" when the contents
were arbitrary - so malloc(n) allocated space for n bytes, strlen(p)
returned the number of characters in p.   No more.

In my reading, of this, the "std" string in the format is '<' 'Z' '>' which
is 3 bytes (all ASCII in this case).

Beyond that, even if I'm wrong, POSIX has (for ages) permitted
implementation defined TZ specifications, beginning with ':' - those
specify no rules on the length, or character set, or existence, of
tzname abbreviations the way that the POSIX TZ string does.
Implementations are free to support anything they like, and users are
free to set TZ to any such string supported by the implementation.
Application code needs to learn to deal with that - pandering to broken
assumptions by insisting on pseudo-rules just because there's some
(unjustified) belief that this is the way it is supposed to be, doesn't
really help anyone.

  | This is not just a standard-lawyer quibble. Real-world software breaks 
  | if you set TZ='<Z>0'.

I consider glibc broken in that case.   On NetBSD, which is using
(in this area, probably much less modified than glibc uses), tzcode I get:

jacaranda$ TZ='<Z>0' date
Fri Mar  3 20:45:25 Z 2023

That's as it should be.   What did glibc (or perhaps Ubuntu) do to things
to break that?   And why?

What does a pure (as distributed) tzcode version do in this case?

  | since POSIX doesn't say what to do with nonconforming strings like '<Z>0'.

Only if that is indeed non-conforming.   I can see another POSIX bug report
coming up, this area clearly needs more clarification.

Note that even should it be decided that this is indeed non-conforming,
an implementation can certainly support
	TZ=':<Z>0'
or even just
	TZ=:Z0
and set the abbreviation to Z and the offset to 0, and POSIX has no
rule against that at all.   Application code needs to learn to deal
with it.   "I've never seen that happen, so it must not be possible"
is a common, but bogus, argument.

Implementations are not required to support that, so applications cannot
depend upon using it - but implementations are allowed to support it,
and users of that implementation are allowed to use it, applications
running on that implementation must be able to deal with the consequences.

  | I suppose you're right about that, if it's merely an issue of conforming 
  | to POSIX, That is, in theory TZ='Europe/Paris' can use whatever time 
  | zone abbreviation we like (including the empty string, or a string 
  | containing newlines :-).

Yes, if it were merely a conformance issue... - though I haven't
checked to see if POSIX decided to impose any rules on what is allowed
in the tm_zone field of a struct tm.   That might limit things, if there
are any restrictions there.

glibc has obviously decided the empty string is OK, as that is what
the example you showed uses:

   $ TZ='<Z>0' date
   Fri Mar  3 17:49:16  2023

notice the two spaces between "16" and "2023".   The abbreviation
is inserted between those, and is clearly empty in this case.

  | Still, I hesitate to depart from the POSIX form, as too much software 
  | expects it.

We already made (and forced) a change, by sticking in +07 type abbreviations,
which are not the 3 or more alpha chars that used to be the norm (and even
longer ago, exactly 3 alpha chars, always).

What I get locally (now) (and I'd much prefer if ICT came back)

jacaranda$ date
Sat Mar  4 03:52:30 +07 2023

which is nothing like what used to be the normal format.   Applications
need to learn to tell the difference between what is guaranteed (in this
area, almost nothing) and what is commonly seen (which is irrelevant, unless
some code wants to optimise for that case, which would be reasonable).

  | I would not be surprised if we 
  | encountered similar problems with time zone abbreviations containing 
  | less than 3 characters,

I'd expect even more problems if the name doesn't appear at all.
But Ubuntu seems to be surviving that, so I suspect it would survive
shorter than 3 byte abbreviations as well.

  | for reasons similar to why Ubuntu 'date' does 
  | not do what you want with TZ='<Z>0' or with TZ='<ET>4'.

You didn't ever say what those reasons are, other than some desire to
conform to something I don't believe POSIX actually requires.  To be a
conforming POSIX TZ string, just perhaps, but nothing else gives any
guarantees, and no-one is required to (and few people do) use those strings.
Most code runs without TZ set at all - in that case it is clear that
there's no 3 byte limit on anything, as the spec for parsing TZ cannot
be relevant if there is no TZ set anywhere to parse.

What happens using glibs with TZ='<A>1' ?   (nb: I'm not sure that A,
as in the US Military timezone designated 'A', really is -0100, it might
be +0100, or something else entirely, that doesn't matter here)

What I see is:

jacaranda$ TZ='<A>1' date ; TZ='<Z>0' date
Fri Mar  3 19:58:25 A 2023
Fri Mar  3 20:58:25 Z 2023

What does Ubuntu (glibc) do in that case?

Again, what I see is almost certainly just what tzcode does, and it is almost
certainly correct.

I certainly see no application conformance benefit in the Ubuntu behaviour
you described - at least in the NetBSD (tzcode?) version, there is an
abbreviation present, it might be shorter than some software might expect
but it isn't absent.   glibc obviously isn't treating the TZ string as
garbage or we'd get something like:

jacaranda$ TZ=/---+99 date
Fri Mar  3 21:03:02 GMT 2023

with the fallback to GMT (or UTC perhaps) when the TZ string specifies
nothing meaningful at all, it isn't doing that.   So since it did
seem to give UTC time (the '0') I'm assuming that it parsed the string,
and then simply decided to break things, because someone believes (incorrectly)
that POSIX requires otherwise (leading to unspecified behaviour, so you can
do what you like - but in that case, doing the reasonable thing, rather
than the vindictive one, seems more beneficial to me).

Further, as tzdata (or other implementations using the newly added TZ
specification type) and : TZ specs have no limits imposed like the POSIX
TZ string imposes, applications need to deal with whatever comes from
them as well.

kre