[tz] the tzdb information schema

Fri Mar 25 20:36:27 UTC 2022

    Some recent difficulties with tzdb updates have led me to consider
    how they could have been avoided.

    • The fact that two timezones agree since 1970 is not represented
      independently from the data of each timezone. This leads to
      "update anomalies" well-known from database theory.
         For instance, before Africa/Khartoum switched to UTC - 02 h in 
2017,
         Africa/Juba was a link (in the file africa) to Africa/Khartoum.
         After the switch, new Zone data for Africa/Juba had to appear in
         the file africa. Actually, Zone data for Africa/Juba had survived
         in the file backzone but this was overlooked, so that we had
         Zone data for Africa/Juba in both files africa and backzone
         from 2017c until 2018b, and these Zone data did not even agree.
         And all this happened for Africa/Juba even though civil time
         in Juba was not changed at all at the time.
      It appears that this error would not have happened if all African
      Zone data had been kept in the file africa (rather than in africa
      and in backzone) -- it was not the Zone data of Africa/Juba that
      had changed, but only the fact that Africa/Juba ws merged with
      Africa/Khartoum because their Zone data agreed since 1970.
      Unfortunately, this fact was not (and still is not) representable
      as an independently changeable item in tzdb.

    • Most of the Zone data currently contained in the file backzone have
      in earlier versions been stored in the "continental" files (africa,
      antarctica, asia, australasia, europe, northamerica, southamerica),
      and the reason why they are in backzone is that their Zone data
      agree with other Zone data since 1970. And since this may well
      change in the future, we have to keep these Zone data in backzone
      current. The file backzone is an integral part of the tzdb data,
      not just a container for additional data of lesser quality.
         The quality of any Zone data (and any Rule data) in tzdb should
         always be "to the best of our current knowledge" -- it just does
         not make sense to keep Zone data in tzdb that are not updated
         when we acquire relevant new information for them.
      Thus we should get rid of Zone data for Argentina/Rosario etc
      (or else update them); keeping data that are known to be wrong
      is not only useless, it is an invitation for consequential errors.

    • Letting derived data (such as whether two timezones should be merged
      because they agree since 1970) decide about the storage location
      of the Zone data for a timezone not only implies unnecessary
      data moves upon data updates, it may also disrupt commentary text
      and its references. Easily understandable comments (such as which
      facts were deduced from which document) are crucial for later updates,
      where the effect of a newly found document has to be determined,
      often after several years, and likely by a different contributor.

    • The fact that two timezones agree since 1970 has nothing to do
      with the fact that some timezones have changed their names, with the
      old names being kept as Links to the new names. Currently, however,
      Links representing one or both facts are kept in the same file 
backward,
      and cannot be distinguished. This leads to information loss and update
      anomalies:
         Currently, the file backward has a Link from Africa/Asmera to
         Africa/Nairobi -- the information that Asmera is an
         outdated spelling of Asmara can only be found in the file
         backzone.
         The name America/Virgin had been replaced by America/St_Thomas
         in version 95k, and this fact could be seen in the file backward
         until 2021a when this information was lost. It reappeared
         (in backzone) only in version 2021c.
      Keeping one type of information (spelling changes) in different
      locations (files backward or backzone) depending on an independent
      condition (that may even change over time) certainly causes
      unnecessary maintenance effort.

    While some of these points may sound like theoretical claims for
    normal forms as taught in computer science, my point here is only
    practical simplicity: each basic fact to be recorded in tzdb should
    have its obvious place where it is stored and where it can be looked
    up and updated; and updates of independent facts should be possible
    without mutual side effects. This appears to be a necessity for a
    collaborative project.

    Last Saturday, Paul Eggert has very nicely summarized the history and
    some of the guiding principles of tzdb. It is largely due to his
    immense work on the maintenance and evolvement of tzdb that the
    tzdb system was such a tremendous success in its first 30 years.

    As a means for the success over the next 30 years, I propose a
    simplification of the tzdb schema, so as to avoid the update anomalies
    described above, and thus decrease the maintenance burden, currently
    mainly shouldered by Paul. The information schema used in the fork
    produced by Stephen Colebourne is already much simpler, and it is
    apparently what is needed by several power users contributing
    to the widespread success of tzdb.

    Michael Deckers.