[tz] Gripe about compressed rule set names in tzdata.zi
tgl at sss.pgh.pa.us
Fri May 11 18:15:56 UTC 2018
Paul Eggert <eggert at cs.ucla.edu> writes:
> We could make tzdata.zi more readable along the lines that you suggest,
> presumably as a runtime option to zishrink.awk and a corresponding
> Makefile macro to let builders select whether they want tzdata.zi to be
> smaller and less-readable, or larger and more-readable.
Hm, I'm inclined to think that that's overkill. With the patch I propose
below, the size of tzdata.zi for 2018e grows from 106908 to 108256 bytes,
or a 1.26% increase; it doesn't seem worth complicating builders' lives
still more to offer an option to avoid that.
What I did was just to hash the input ruleset names, remove collisions
through an open-chaining adjustment, and generate new names by converting
the hashes back to strings. The hash function is pretty trivial, but
I'm not sure it's worth working harder (or practical to do anything more
interesting in awk, anyway). I get three collisions on the current set
Collision between Russia and Algeria at hash 641
Collision between NZ and SA at hash 96
Collision between Ecuador and Iraq at hash 2403
Given that we're trying to map 134 names into a space of 2704 hash values,
some collisions are practically inevitable, and so I doubt we'd do better
with a different hash.
Note that I'm mapping the names into upper and lower case letters only.
We could reduce the probability of a collision a little by also using
punctuation as the current code does, but I think that that's not actually
a good design: if the ruleset syntax is ever expanded to make punctuation
have some other meaning, the existing compression rule is going to cause
forward-compatibility problems. Still, if you're convinced that that will
never happen, the attached patch can easily be adjusted to restore the
larger output alphabet.
With this approach, I estimate that there's at most about a 5% chance of
a new ruleset name causing one existing ruleset's abbreviation to change,
and a very small chance of it affecting more than one existing ruleset.
So that's a great deal better than the existing way as far as the
stability of the tzdata.zi representation goes.
regards, tom lane
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 2133 bytes
More information about the tz