[tz] the tzdb information schema
Michael H Deckers
michael.h.deckers at googlemail.com
Fri Mar 25 20:36:27 UTC 2022
Some recent difficulties with tzdb updates have led me to consider
how they could have been avoided.
• The fact that two timezones agree since 1970 is not represented
independently from the data of each timezone. This leads to
"update anomalies" well-known from database theory.
For instance, before Africa/Khartoum switched to UTC - 02 h in
2017,
Africa/Juba was a link (in the file africa) to Africa/Khartoum.
After the switch, new Zone data for Africa/Juba had to appear in
the file africa. Actually, Zone data for Africa/Juba had survived
in the file backzone but this was overlooked, so that we had
Zone data for Africa/Juba in both files africa and backzone
from 2017c until 2018b, and these Zone data did not even agree.
And all this happened for Africa/Juba even though civil time
in Juba was not changed at all at the time.
It appears that this error would not have happened if all African
Zone data had been kept in the file africa (rather than in africa
and in backzone) -- it was not the Zone data of Africa/Juba that
had changed, but only the fact that Africa/Juba ws merged with
Africa/Khartoum because their Zone data agreed since 1970.
Unfortunately, this fact was not (and still is not) representable
as an independently changeable item in tzdb.
• Most of the Zone data currently contained in the file backzone have
in earlier versions been stored in the "continental" files (africa,
antarctica, asia, australasia, europe, northamerica, southamerica),
and the reason why they are in backzone is that their Zone data
agree with other Zone data since 1970. And since this may well
change in the future, we have to keep these Zone data in backzone
current. The file backzone is an integral part of the tzdb data,
not just a container for additional data of lesser quality.
The quality of any Zone data (and any Rule data) in tzdb should
always be "to the best of our current knowledge" -- it just does
not make sense to keep Zone data in tzdb that are not updated
when we acquire relevant new information for them.
Thus we should get rid of Zone data for Argentina/Rosario etc
(or else update them); keeping data that are known to be wrong
is not only useless, it is an invitation for consequential errors.
• Letting derived data (such as whether two timezones should be merged
because they agree since 1970) decide about the storage location
of the Zone data for a timezone not only implies unnecessary
data moves upon data updates, it may also disrupt commentary text
and its references. Easily understandable comments (such as which
facts were deduced from which document) are crucial for later updates,
where the effect of a newly found document has to be determined,
often after several years, and likely by a different contributor.
• The fact that two timezones agree since 1970 has nothing to do
with the fact that some timezones have changed their names, with the
old names being kept as Links to the new names. Currently, however,
Links representing one or both facts are kept in the same file
backward,
and cannot be distinguished. This leads to information loss and update
anomalies:
Currently, the file backward has a Link from Africa/Asmera to
Africa/Nairobi -- the information that Asmera is an
outdated spelling of Asmara can only be found in the file
backzone.
The name America/Virgin had been replaced by America/St_Thomas
in version 95k, and this fact could be seen in the file backward
until 2021a when this information was lost. It reappeared
(in backzone) only in version 2021c.
Keeping one type of information (spelling changes) in different
locations (files backward or backzone) depending on an independent
condition (that may even change over time) certainly causes
unnecessary maintenance effort.
While some of these points may sound like theoretical claims for
normal forms as taught in computer science, my point here is only
practical simplicity: each basic fact to be recorded in tzdb should
have its obvious place where it is stored and where it can be looked
up and updated; and updates of independent facts should be possible
without mutual side effects. This appears to be a necessity for a
collaborative project.
Last Saturday, Paul Eggert has very nicely summarized the history and
some of the guiding principles of tzdb. It is largely due to his
immense work on the maintenance and evolvement of tzdb that the
tzdb system was such a tremendous success in its first 30 years.
As a means for the success over the next 30 years, I propose a
simplification of the tzdb schema, so as to avoid the update anomalies
described above, and thus decrease the maintenance burden, currently
mainly shouldered by Paul. The information schema used in the fork
produced by Stephen Colebourne is already much simpler, and it is
apparently what is needed by several power users contributing
to the widespread success of tzdb.
Michael Deckers.
More information about the tz
mailing list