[tz] Pre-1970 data

Stephen Colebourne scolebourne at joda.org
Mon Oct 18 13:07:39 UTC 2021


How can this project successfully handle pre-1970 data? Why have I
focussed so much on IDs and their value to tzdb?

Intro
-------
In previous threads I've focussed on IDs, and laid out different ways
to describe the groups of IDs we have. For now, lets focus solely on
the two main groups
- region IDs, which represent abstract regions where clocks have been
the same since 1970
- non-region IDs, which represent locations where tzdb has, at some
point, added an ID

The key observation is that the segregation between these two groups
of IDs *did not exist* until around 2014. It was only in 2018 that the
ISO country rule was removed. It has only been since 2014 or so that
IDs have been merged. What tzdb previously offered was a set of IDs,
based on a simple rule - "ID as needed for post-1970 data, with at
least one per ISO country". Full history was available for each of
these (whether accurate or not). What has happened since is a split,
where no split previously existed. A split which favours some
locations over others.

The recent mailing list debates merely represent the cumulation of
this favouritism. The irony is that the merges are being done in the
name of equity and fairness, when the outcome has actually been the
exact opposite - picking favourites, and denigrating everywhere else.
That the approach to picking favourites is according to a standardised
largest city rule isn't really that relevant here - it is the outcome
that is unfair, not the process.

If region IDs were of different appearance (eg. numeric or textually
different) then the issues would not have arisen. The mistake was
taking a fully functional and fully integral set of IDs, and
bifurcating it into two groups. The split was actually a huge change
in the policy of tzdb, which has been added drip by drip, rather than
something that was ever fully appreciated up front.

FWIW, it is clear to me that there is an aspect of imposing a
US-centric timezone system on other parts of the world. The recent
tzdb approach of focussing entirely on timezone regions makes perfect
sense for the US, where region boundaries do not follow state lines,
and ordinary members of the public need to be aware of whether they
are in US/Mountain or US/Central. This simply isn't the timezone model
in many other parts of the world. In places like Europe and Asia, the
timezone is driven primarily by the country you live in - an ordinary
member of the public in Iceland is never going to associate with some
abstract timezone region stretching down the Atlantic that is not
named, not legally defined and is little more than a random outcome
based on tzdb's choice of 1970. Even in somewhere like Norway, an
ordinary member of the public will understand that although they
follow CET, their timezone is actually driven by their Government in
Oslo. The brilliance of the original rule - "ID as needed for
post-1970 data, with at least one per ISO country" - was that it
seamlessly handled *both* models of timezone in one unified set of
IDs. Removal of the ISO country part has completely destabilised that
balance.

As a constraint to this thread, tzdb really needs to offer one
standard view of data, not command line flags that allow different
views. If downstream projects end up with different views of the data,
it makes tzdb a much less reliable source. (tzdb can be packaged in
different ways on the same machine, for example it is undesirable for
Postgres' internal tzdb and the OS tzdb to diverge for the same
version). Given this, what needs to be nailed down is what is the
default data set that tzdb publishes - there isn't really much point
in talking about compile time flags, or that the contents of backzone
could be used by someone.


Lets look at seven options for pre-1970 data:

Pre-1970 data for regions only
-----------------------------------------
1) Pre-1970 data for regions only
- Despite looking identical to other IDs, region IDs are treated as
special/favoured
- Pre-1970 data for non-region IDs is of no importance whatsoever,
thus most get pre-1970 data from another country/continent
- The split between region and non-region locations is fully completed
- the US-centric timezone model is dominant
- An end user in Iceland is supposed to use Africa/Abidjan,
Europe/Reykjavik is treated as a historical mistake of tzdb kept
around only for backwards compatibility

Pre-1970 data for most IDs
------------------------------------
2) Pre-1970 data for each ID meeting the rule "ID as needed for
post-1970 data, with at least one per ISO country"
- The split between region and non-region locations is healed
- High quality data from places like Iceland and Norway is retained,
but low quality data from elsewhere is restored
- The pre-1970 data is simply viewed as the best available data for
the each location which can be improved over time
- Most end-users in Iceland would expect tzdb to provide pre-1970 data
from Iceland, not the Ivory Coast

3) Pre-1970 data for for all IDs that currently exist except true aliases
- This is very similar to #2, but would include something like
Montreal which was effectively mistakenly added to tzdb
- This doesn't seem as desirable as #2

4) Pre-1970 data for any ID where the pre-1970 data is high quality
- Subjective on quality, which doesn't seem like a great idea
- It does avoid bringing bad quality data back into the main tzdb distribution
- On balance, #2 has fewer places for debate to arise

5) Pre-1970 data for all IDs that currently exist except true aliases,
plus a *new* set of IDs representing regions
- As per #2, but adding IDs like "Region/12345" or "Region/Berlin"
- Regions should contain post-1970 data only, as the region is by
definition only meaningful post-1970
- It is not entirely clear what this solves over just going with #2
- This option is connected to Russ' proposals [1], although he
suggests a bigger split between timekeeping data and ID naming
- Perhaps it makes sense if the new region IDs were internal and not
normally seen by end users?

Remove pre-1970 data from general use
-------------------------------------------------------
6) No pre-1970 data whatsoever, all IDs are post-1970 only
- Project policy is that tzdb is focussed on post-1970 data only
- Paul repeatedly tells us that pre-1970 data is unreliable, and
people shouldn't use it
- Pre-1970 data would not be deleted, but would not be available in
most downstreams
- The split between region and non-region locations is effectively healed
- End users lose access to pre-1970 data, which is particularly
notable in some locations where that data is reliable, eg London
- It is unknown at this point what the user impact is of removing
pre-1970 data from major financial/business centres (no major location
has yet been merged)

7) No existing IDs get pre-1970 data, but a *new* set of IDs are
created containing it
- As per #5, the existing IDs get no pre-1970 data and the split
between region and non-region locations is healed
- New IDs, such as "Historic/Europe/London" or
"Historic/Africa/Abidjan" get created for each of the main IDs
- The historic IDs include pre-1970 data, the standard IDs do not
- There is a mechanical transformation between the historic and non-historic IDs
- Whether downstreams do or do not include the historic IDs is their
choice, potentially based on available space
- The data provided by the standard non-historic IDs remains the same
whether the downstream includes the historic IDs or not (a Good Thing)
- Users have to deliberately opt-in to get pre-1970 data, which might
make them think about the accuracy issue

Notes
--------
- I'm not discussing adding new IDs simply to represent locations
whose clocks differ only before 1970. I don't personally think that is
a job for tzdb, and even if it is, it is a job for a different thread.
- I'm not discussing what value is returned prior to 1970 when
pre-1970 data is removed. That would be a job for a different thread.
- Link vs Zone is not important for this discussion.

Summary
-------------
After a few weeks of thinking, these are the options I've come up
with. Feel free to suggest another option or variant if you think I've
missed anything obvious.

I believe that was a disaster that the brilliant "ID as needed for
post-1970 data, with at least one per ISO country" rule was removed.
It has created needless division, bifurcating a unified set of ID into
region and non-region IDs and creating backwards compatibility issues
in the data of many locations. IMO, there are two basic models of
timezones in the world, and moving tzdb from one that supported both
to one that only supports the US-centric model is simply a mistake
that needs correcting.

As such, my preference would be to adopt option 2. Option 6 or 7 could
work, and might be the best choices if we had a clean slate or if
there was some hidden pressure that the list is unaware of to remove
pre-1970 data, but they are risky options given we do not know the
impact on end-users in major financial/business centres.

Stephen

[1] https://mm.icann.org/pipermail/tz/2021-September/030518.html


More information about the tz mailing list