[tz] Pre-1970 data

Tue Nov 2 07:09:26 UTC 2021

Thanks Brian and Adhemar for your thoughts. Does anyone else want to
chime in on the best way to move forward?
thanks
Stephen

On Fri, 22 Oct 2021 at 00:59, Brian Park <brian at xparks.net> wrote:
>
> This discussion seems to have settled down, so here are my thoughts:
>
> 1) I would like to commend Paul Eggert for handling this debate as graciously as possible under the circumstances.
>
> 2) I would also like to commend Stephen Colebourne for his persistence in raising the problem of losing the pre-1970 data, and the problem of merging zone IDs from seemingly unrelated regions and political entities. (Although I think the discussion about replacing the TZDB Coordinator was perhaps not helpful, since no one else seemed to want the job.)
>
> 3) In an ideal world, I think the decision to use or not use the pre-1970 data ought to be made by the end-user or the end-developer, not by the timezone library maintainer or the OS maintainer. Even if some of the pre-1970 data is "low quality", if that information is not available anywhere else, TZ DB should make it readily accessible to the end-users.
>
> 4) As I recall, Paul gave 2 main reasons for moving the pre-1970 data to backzone and combining post-1970 IDs: (a) fairness, and (b) the maintenance burden.  (a) The fairness argument has not made sense to me, even after seeing it multiple times. Personally, I think it is ok that some countries have better data than others, as long as it was not caused by malicious intent. (b) The maintenance burden argument is more compelling. If Paul Eggert is the only one willing to maintain this data, and if he finds it burdensome, then it is what it is.
>
> 5) With regards to the specific options listed in Stephen's email, I find Option 2 to be compelling ("ID as needed for post-1970 data, with at least one per ISO country") , because I think most end-users and end-developers understand timezones in this way --- their country's political system determines the rules for the timezone(s) in their country. It seems to me that the concept of timezones is inherently a political creation, not a technical one. Various posts on this list about how we should "avoid politics" have not made sense to me.
>
> 6) I also find Option 7 to be interesting ("No existing IDs get pre-1970 data, but a *new* set of IDs are
> created containing it"). I offer a slight variation: What if we placed the "Historic" part at the end of the ID path, such as "Europe/London/Historic" and "Africa/Abidjan/Historic"? Then the timezone library can choose to use the closest matching timezone if it does not have a Historic pre-1970 database installed, so it can default to "Europe/London" and "Africa/Abidjan" instead.
>
> 7) As a maintainer of an independent timezone library, I would like to request that the "API" into the TZDB project be the raw files themselves (e.g. africa, europe, northamerica, etc), instead of the TZif files or the Makefile. My library uses its own TZDB parser, and its own binary representation instead of TZif, and does not use zic, zdump, or the provided Makefile. I believe there are other major 3rd party libraries which have their own parsers and binary representation formats: Joda-Time, Java java.time, C++20/Hinnant date, and Noda Time.
>
> 8) If the only way for end-users to have access to the pre-1970 data is through a fork of TZDB, then it is not ideal, but I don't think it's the end of the world. Different libraries may choose to use different databases, and users will have to deal with mismatching timezone identifiers and differing DST transition rules. But it seems that end-users and end-developers are forced to deal with those issues right now anyway. Since different libraries are packaged with different versions of the TZDB, and different OS's have different update schedules.
>
> Brian
>
> On Mon, Oct 18, 2021 at 6:08 AM Stephen Colebourne via tz <tz at iana.org> wrote:
>>
>> How can this project successfully handle pre-1970 data? Why have I
>> focussed so much on IDs and their value to tzdb?
>>
>> Intro
>> -------
>> In previous threads I've focussed on IDs, and laid out different ways
>> to describe the groups of IDs we have. For now, lets focus solely on
>> the two main groups
>> - region IDs, which represent abstract regions where clocks have been
>> the same since 1970
>> - non-region IDs, which represent locations where tzdb has, at some
>> point, added an ID
>>
>> The key observation is that the segregation between these two groups
>> of IDs *did not exist* until around 2014. It was only in 2018 that the
>> ISO country rule was removed. It has only been since 2014 or so that
>> IDs have been merged. What tzdb previously offered was a set of IDs,
>> based on a simple rule - "ID as needed for post-1970 data, with at
>> least one per ISO country". Full history was available for each of
>> these (whether accurate or not). What has happened since is a split,
>> where no split previously existed. A split which favours some
>> locations over others.
>>
>> The recent mailing list debates merely represent the cumulation of
>> this favouritism. The irony is that the merges are being done in the
>> name of equity and fairness, when the outcome has actually been the
>> exact opposite - picking favourites, and denigrating everywhere else.
>> That the approach to picking favourites is according to a standardised
>> largest city rule isn't really that relevant here - it is the outcome
>> that is unfair, not the process.
>>
>> If region IDs were of different appearance (eg. numeric or textually
>> different) then the issues would not have arisen. The mistake was
>> taking a fully functional and fully integral set of IDs, and
>> bifurcating it into two groups. The split was actually a huge change
>> in the policy of tzdb, which has been added drip by drip, rather than
>> something that was ever fully appreciated up front.
>>
>> FWIW, it is clear to me that there is an aspect of imposing a
>> US-centric timezone system on other parts of the world. The recent
>> tzdb approach of focussing entirely on timezone regions makes perfect
>> sense for the US, where region boundaries do not follow state lines,
>> and ordinary members of the public need to be aware of whether they
>> are in US/Mountain or US/Central. This simply isn't the timezone model
>> in many other parts of the world. In places like Europe and Asia, the
>> timezone is driven primarily by the country you live in - an ordinary
>> member of the public in Iceland is never going to associate with some
>> abstract timezone region stretching down the Atlantic that is not
>> named, not legally defined and is little more than a random outcome
>> based on tzdb's choice of 1970. Even in somewhere like Norway, an
>> ordinary member of the public will understand that although they
>> follow CET, their timezone is actually driven by their Government in
>> Oslo. The brilliance of the original rule - "ID as needed for
>> post-1970 data, with at least one per ISO country" - was that it
>> seamlessly handled *both* models of timezone in one unified set of
>> IDs. Removal of the ISO country part has completely destabilised that
>> balance.
>>
>> As a constraint to this thread, tzdb really needs to offer one
>> standard view of data, not command line flags that allow different
>> views. If downstream projects end up with different views of the data,
>> it makes tzdb a much less reliable source. (tzdb can be packaged in
>> different ways on the same machine, for example it is undesirable for
>> Postgres' internal tzdb and the OS tzdb to diverge for the same
>> version). Given this, what needs to be nailed down is what is the
>> default data set that tzdb publishes - there isn't really much point
>> in talking about compile time flags, or that the contents of backzone
>> could be used by someone.
>>
>>
>> Lets look at seven options for pre-1970 data:
>>
>> Pre-1970 data for regions only
>> -----------------------------------------
>> 1) Pre-1970 data for regions only
>> - Despite looking identical to other IDs, region IDs are treated as
>> special/favoured
>> - Pre-1970 data for non-region IDs is of no importance whatsoever,
>> thus most get pre-1970 data from another country/continent
>> - The split between region and non-region locations is fully completed
>> - the US-centric timezone model is dominant
>> - An end user in Iceland is supposed to use Africa/Abidjan,
>> Europe/Reykjavik is treated as a historical mistake of tzdb kept
>> around only for backwards compatibility
>>
>> Pre-1970 data for most IDs
>> ------------------------------------
>> 2) Pre-1970 data for each ID meeting the rule "ID as needed for
>> post-1970 data, with at least one per ISO country"
>> - The split between region and non-region locations is healed
>> - High quality data from places like Iceland and Norway is retained,
>> but low quality data from elsewhere is restored
>> - The pre-1970 data is simply viewed as the best available data for
>> the each location which can be improved over time
>> - Most end-users in Iceland would expect tzdb to provide pre-1970 data
>> from Iceland, not the Ivory Coast
>>
>> 3) Pre-1970 data for for all IDs that currently exist except true aliases
>> - This is very similar to #2, but would include something like
>> Montreal which was effectively mistakenly added to tzdb
>> - This doesn't seem as desirable as #2
>>
>> 4) Pre-1970 data for any ID where the pre-1970 data is high quality
>> - Subjective on quality, which doesn't seem like a great idea
>> - It does avoid bringing bad quality data back into the main tzdb distribution
>> - On balance, #2 has fewer places for debate to arise
>>
>> 5) Pre-1970 data for all IDs that currently exist except true aliases,
>> plus a *new* set of IDs representing regions
>> - As per #2, but adding IDs like "Region/12345" or "Region/Berlin"
>> - Regions should contain post-1970 data only, as the region is by
>> definition only meaningful post-1970
>> - It is not entirely clear what this solves over just going with #2
>> - This option is connected to Russ' proposals [1], although he
>> suggests a bigger split between timekeeping data and ID naming
>> - Perhaps it makes sense if the new region IDs were internal and not
>> normally seen by end users?
>>
>> Remove pre-1970 data from general use
>> -------------------------------------------------------
>> 6) No pre-1970 data whatsoever, all IDs are post-1970 only
>> - Project policy is that tzdb is focussed on post-1970 data only
>> - Paul repeatedly tells us that pre-1970 data is unreliable, and
>> people shouldn't use it
>> - Pre-1970 data would not be deleted, but would not be available in
>> most downstreams
>> - The split between region and non-region locations is effectively healed
>> - End users lose access to pre-1970 data, which is particularly
>> notable in some locations where that data is reliable, eg London
>> - It is unknown at this point what the user impact is of removing
>> pre-1970 data from major financial/business centres (no major location
>> has yet been merged)
>>
>> 7) No existing IDs get pre-1970 data, but a *new* set of IDs are
>> created containing it
>> - As per #5, the existing IDs get no pre-1970 data and the split
>> between region and non-region locations is healed
>> - New IDs, such as "Historic/Europe/London" or
>> "Historic/Africa/Abidjan" get created for each of the main IDs
>> - The historic IDs include pre-1970 data, the standard IDs do not
>> - There is a mechanical transformation between the historic and non-historic IDs
>> - Whether downstreams do or do not include the historic IDs is their
>> choice, potentially based on available space
>> - The data provided by the standard non-historic IDs remains the same
>> whether the downstream includes the historic IDs or not (a Good Thing)
>> - Users have to deliberately opt-in to get pre-1970 data, which might
>> make them think about the accuracy issue
>>
>> Notes
>> --------
>> - I'm not discussing adding new IDs simply to represent locations
>> whose clocks differ only before 1970. I don't personally think that is
>> a job for tzdb, and even if it is, it is a job for a different thread.
>> - I'm not discussing what value is returned prior to 1970 when
>> pre-1970 data is removed. That would be a job for a different thread.
>> - Link vs Zone is not important for this discussion.
>>
>> Summary
>> -------------
>> After a few weeks of thinking, these are the options I've come up
>> with. Feel free to suggest another option or variant if you think I've
>> missed anything obvious.
>>
>> I believe that was a disaster that the brilliant "ID as needed for
>> post-1970 data, with at least one per ISO country" rule was removed.
>> It has created needless division, bifurcating a unified set of ID into
>> region and non-region IDs and creating backwards compatibility issues
>> in the data of many locations. IMO, there are two basic models of
>> timezones in the world, and moving tzdb from one that supported both
>> to one that only supports the US-centric model is simply a mistake
>> that needs correcting.
>>
>> As such, my preference would be to adopt option 2. Option 6 or 7 could
>> work, and might be the best choices if we had a clean slate or if
>> there was some hidden pressure that the list is unaware of to remove
>> pre-1970 data, but they are risky options given we do not know the
>> impact on end-users in major financial/business centres.
>>
>> Stephen
>>
>> [1] https://mm.icann.org/pipermail/tz/2021-September/030518.html