[tz] winnowing stats

Mon Sep 2 22:59:53 UTC 2013

Here are some stats on how much difference winnowing makes to the dataset.
Baseline: the source files define 441 distinct zones.  Winnowing then
reduces the number of distinct zones thus:

	0000    440
	1880    440
	1890    439
	1900    439
	1910    437
	1920    435
	1930    435
	1940    434
	1950    425
	1960    423
	1970    417
	1980    391
	1990    365
	2000    339
	2010    313
	2020    305

The reduction by one zone for a threshold year of 0000 comes from
Pacific/Johnston, a US minor outlying island, which is defined with data
identical to the "HST" zone.  (Contrary to the usual practice of using
LMT for the first segment of geographical zones.)  As the two tzfiles
are byte-for-byte identical, tzwinnow will merge them regardless of
date thresholds.

The results for a threshold in the future set a lower limit on the size
to which the database can be reduced by this mechanism.  Relative to the
present full database of 441 zones, it's only a modest gain.  Using a
threshold later than 1970 for installation purposes will probably only
be attractive in a few more decades' time, when there have been many
more zone splits arising from contemporary activity.

A threshold later than 1970 is currently much more valuable for
tzselect purposes.  Here it's saving human cognitive load rather than
storage space.  The reduction in the number of zones is concentrated
disproportionately in the countries that have the greatest complexity.
This is particularly noticeable with the US zones, the list of which is
quite unwieldy in unwinnowed form.

Over the period where we attempt complete coverage (1970 to today),
the rate at which the number of distinct zones changes is amazingly
consistent at 26 per decade.  A similar pattern emerges when winnowing
with a varying upper date limit, showing that there's a long-term roughly
constant rate of zone churn.  The number of zones distinct within a single
decade is also roughly constant, around 330.  The number distinct within
a single year hovers around 300, possibly showing a slight rising trend,
but I suspect that's an artifact of incomplete data.  This suggests that
a strategy of winnowing with a moving threshold that remains N years ago
will produce a roughly constant zone count.  By contrast, the number
of zones differing at any time post-1970 (currently 417), or from any
other fixed threshold, can grow without bound, and it looks like it will.


