[tz] Gripe about compressed rule set names in tzdata.zi

Mon May 14 01:54:13 UTC 2018

Tom Lane wrote:

> I doubt we'd do better with a different hash.

We can do a bit better; the attached patches uses a hash that shrinks the size 
of tzdata.zi by about 0.5% compared to the method used in 2018e. This hash 
should also avoid needless churn during updates.
> if the ruleset syntax is ever expanded to make punctuation
> have some other meaning, the existing compression rule is going to cause
> forward-compatibility problems.

Good point. The data entries are already using some punctuation characters as 
Rule names and so these characters are fair game, but we should reserve some of 
the never-used characters. The attached proposed patches reserve the characters 
in "!$%&'()*,/:;<=>?@[\]^`{|}~", unless quoted. (However, this restriction is 
not enforced by zic in the attached patches.)

The attached patches also require Rule names to begin with a character that is 
not a digit, -, +, or white space; zic already rejected the empty string (this 
was not documented) and there were ambiguities if one of these characters 
started a Rule name so I added a check for this to zic.

If anybody uses unusual Rule names, now's a good time to speak up.
-------------- next part --------------
From dca171f57aa35b53780cd16caa43bcdb0d4a1b40 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert at cs.ucla.edu>
Date: Sun, 13 May 2018 14:05:04 -0700
Subject: [PROPOSED 1/4] Constrain Rule names

* NEWS: Mention this.
* zic.8 (DESCRIPTION): Say that Rule names must start with a character
that is neither "-" nor "+" nor a digit; this avoids ambiguity with
integer offsets and disallows empty names.  Say also that unquoted
names should not contain !$%&'()*,/:;<=>?@[\]^`{|}~ to allow for future
extensions.  (Possibility of future extensions noted by Tom Lane.)
---
 NEWS  |  8 ++++++++
 zic.8 | 10 +++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index 05db756..56adc24 100644
--- a/NEWS
+++ b/NEWS
@@ -17,6 +17,14 @@ Unreleased, experimental changes
     observed DST in 1942/79, not 1961/80, and there were several
     errors for transition times and dates.  (Thanks to P Chan.)
 
+  Changes to documentation
+
+    New restrictions: A Rule name must start with a character that
+    is neither an ASCII digit nor "-" nor "+", and an unquoted name
+    should not use characters in the set "!$%&'()*,/:;<=>?@[\]^`{|}~".
+    The latter restriction makes room for future extensions (a
+    possibility noted by Tom Lane).
+
   Changes to build procedure
 
     New 'make' target 'rearguard_tarballs' to build the rearguard
diff --git a/zic.8 b/zic.8
index d105b24..5494efb 100644
--- a/zic.8
+++ b/zic.8
@@ -178,7 +178,15 @@ Rule	US	1967	1973	\*-	Apr	lastSun	2:00s	1:00d	D
 The fields that make up a rule line are:
 .TP "\w'LETTER/S'u"
 .B NAME
-Gives the (arbitrary) name of the set of rules this rule is part of.
+Gives the name of the set of rules this rule is part of.
+The name must start with a character that is neither
+an ASCII digit nor
+.q \*-
+nor
+.q + .
+To allow for future extensions,
+an unquoted name should not contain characters from the set
+.q !$%&'()*,/:;<=>?@[\e]^`{|}~ .
 .TP
 .B FROM
 Gives the first year in which the rule applies.
-- 
2.7.4

-------------- next part --------------
From e4960431d463b4ab1d1a06a4dd55bea7cbccecc1 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert at cs.ucla.edu>
Date: Sun, 13 May 2018 18:21:09 -0700
Subject: [PROPOSED 2/4] Check for invalid rule name initials
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* zic.c (inrule): Check that rule names do not begin with
now-forbidden characters.  The rule about what unquoted rule names
can contain is harder to check, so don’t bother with that now.
---
 zic.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/zic.c b/zic.c
index 31f1092..0c4c384 100644
--- a/zic.c
+++ b/zic.c
@@ -1278,8 +1278,13 @@ inrule(char **fields, int nfields)
 		error(_("wrong number of fields on Rule line"));
 		return;
 	}
-	if (*fields[RF_NAME] == '\0') {
-		error(_("nameless rule"));
+	switch (*fields[RF_NAME]) {
+	  case '\0':
+	  case ' ': case '\f': case '\n': case '\r': case '\t': case '\v':
+	  case '+': case '-':
+	  case '0': case '1': case '2': case '3': case '4':
+	  case '5': case '6': case '7': case '8': case '9':
+		error(_("Invalid rule name \"%s\""), fields[RF_NAME]);
 		return;
 	}
 	r.r_filename = filename;
-- 
2.7.4

-------------- next part --------------
From f73e9da343cd48a7ab094a1a5b2ab19ae8fb08d9 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert at cs.ucla.edu>
Date: Sun, 13 May 2018 14:15:46 -0700
Subject: [PROPOSED 3/4] Stabilize rule name abbreviations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Problem reported by Tom Lane in:
https://mm.icann.org/pipermail/tz/2018-May/026469.html
Instead of Lane’s simple proposal, use a more-complex hash that
shortens the overall output of zishrink.awk and generates output
that is easier for humans to remember.
* NEWS: Mention this.
* zishrink.awk (record_hash, prehash_rule_names): New functions.
(gen_rule_name): New arg NAME.  All uses changed.
Use a simple mnemonic: the first two letters.
Check for collisions by calling record_hash.
(BEGIN): Initialize hash table.
---
 NEWS         |   3 ++
 zishrink.awk | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 142 insertions(+), 20 deletions(-)

diff --git a/NEWS b/NEWS
index 56adc24..a91f00a 100644
--- a/NEWS
+++ b/NEWS
@@ -32,6 +32,9 @@ Unreleased, experimental changes
     if you want to build the rearguard tarball.  (Problem reported by
     Deborah Goldsmith.)
 
+    tzdata.zi is now more stable from release to release.  (Problem
+    noted by Tom Lane.)  It is also a bit shorter.
+
 
 Release 2018e - 2018-05-01 23:42:51 -0700
 
diff --git a/zishrink.awk b/zishrink.awk
index d617644..21c71c0 100644
--- a/zishrink.awk
+++ b/zishrink.awk
@@ -6,28 +6,146 @@
 # 'zic' should treat this script's output as if it were identical to
 # this script's input.
 
+# Record a hash N for the new name NAME, checking for collisions.
 
-# Return a new rule name.
-# N_RULE_NAMES keeps track of how many rule names have been generated.
+function record_hash(n, name)
+{
+  if (used_hashes[n]) {
+    printf "# ! collision: %s %s\n", used_hashes[n], name
+    exit 1
+  }
+  used_hashes[n] = name
+}
+
+# Return a shortened rule name representing NAME,
+# and record this relationship to the hash table.
+
+function gen_rule_name(name, n)
+{
+  # Use a simple memonic: the first two letters.
+  n = substr(name, 1, 2)
+  record_hash(n, name)
+  # printf "# %s = %s\n", n, name
+  return n
+}
 
-function gen_rule_name(alphabet, base, rule_name, n, digit)
+function prehash_rule_names(name)
 {
-  alphabet = ""
-  alphabet = alphabet "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
-  alphabet = alphabet "abcdefghijklmnopqrstuvwxyz"
-  alphabet = alphabet "!$%&'()*+,./:;<=>?@[\\]^_`{|}~"
-  base = length(alphabet)
-  rule_name = ""
-  n = n_rule_names++
-
-  do {
-    n -= rule_name && n <= base
-    digit = n % base
-    rule_name = substr(alphabet, digit + 1, 1) rule_name
-    n = (n - digit) / base
-  } while (n);
-
-  return rule_name
+  # Rule names are not part of the tzdb API, so substitute shorter
+  # ones.  Shortening them consistently from one release to the next
+  # simplifies comparison of the output.  That being said, the
+  # 1-letter names below are not standardized in any way, and can
+  # change arbitrarily from one release to the next, as the main goal
+  # here is compression not comparison.
+
+  # Abbreviating these rules names to one letter saved the most space
+  # circa 2018e.
+  rule["Arg"] = "A"
+  rule["Brazil"] = "B"
+  rule["Canada"] = "C"
+  rule["Denmark"] = "D"
+  rule["EU"] = "E"
+  rule["France"] = "F"
+  rule["GB-Eire"] = "G"
+  rule["Halifax"] = "H"
+  rule["Italy"] = "I"
+  rule["Jordan"] = "J"
+  rule["Egypt"] = "K" # "Kemet" in ancient Egyptian
+  rule["Libya"] = "L"
+  rule["Morocco"] = "M"
+  rule["Neth"] = "N"
+  rule["Poland"] = "O" # arbitrary
+  rule["Palestine"] = "P"
+  rule["Cuba"] = "Q" # Its start sounds like "Q".
+  rule["Russia"] = "R"
+  rule["Syria"] = "S"
+  rule["Turkey"] = "T"
+  rule["Uruguay"] = "U"
+  rule["Vincennes"] = "V"
+  rule["Winn"] = "W"
+  rule["Mongol"] = "X" # arbitrary
+  rule["NT_YK"] = "Y"
+  rule["Zion"] = "Z"
+  rule["Austria"] = "a"
+  rule["Belgium"] = "b"
+  rule["C-Eur"] = "c"
+  rule["Algeria"] = "d" # country code DZ
+  rule["E-Eur"] = "e"
+  rule["Taiwan"] = "f" # Formosa
+  rule["Greece"] = "g"
+  rule["Hungary"] = "h"
+  rule["Iran"] = "i"
+  rule["StJohns"] = "j"
+  rule["Chatham"] = "k" # arbitrary
+  rule["Lebanon"] = "l"
+  rule["Mexico"] = "m"
+  rule["Tunisia"] = "n" # country code TN
+  rule["Moncton"] = "o" # arbitrary
+  rule["Port"] = "p"
+  rule["Albania"] = "q"
+  rule["Regina"] = "r"
+  rule["Spain"] = "s"
+  rule["Toronto"] = "t"
+  rule["US"] = "u"
+  rule["Louisville"] = "v" # ville
+  rule["Iceland"] = "w" # arbitrary
+  rule["Chile"] = "x" # arbitrary
+  rule["Para"] = "y" # country code PY
+  rule["Romania"] = "z" # arbitrary
+  rule["Macau"] = "_" # arbitrary
+
+  # Use ISO 3166 alpha-2 country codes for remaining names that are countries.
+  # This is more systematic, and avoids collisions (e.g., Malta and Moldova).
+  rule["Armenia"] = "AM"
+  rule["Aus"] = "AU"
+  rule["Azer"] = "AZ"
+  rule["Barb"] = "BB"
+  rule["Dhaka"] = "BD"
+  rule["Bulg"] = "BG"
+  rule["Bahamas"] = "BS"
+  rule["Belize"] = "BZ"
+  rule["Swiss"] = "CH"
+  rule["Cook"] = "CK"
+  rule["PRC"] = "CN"
+  rule["Cyprus"] = "CY"
+  rule["Czech"] = "CZ"
+  rule["Germany"] = "DE"
+  rule["DR"] = "DO"
+  rule["Ecuador"] = "EC"
+  rule["Finland"] = "FI"
+  rule["Fiji"] = "FJ"
+  rule["Falk"] = "FK"
+  rule["Ghana"] = "GH"
+  rule["Guat"] = "GT"
+  rule["Hond"] = "HN"
+  rule["Haiti"] = "HT"
+  rule["Eire"] = "IE"
+  rule["Iraq"] = "IQ"
+  rule["Japan"] = "JP"
+  rule["Kyrgyz"] = "KG"
+  rule["ROK"] = "KR"
+  rule["Latvia"] = "LV"
+  rule["Lux"] = "LX"
+  rule["Moldova"] = "MD"
+  rule["Malta"] = "MT"
+  rule["Mauritius"] = "MU"
+  rule["Namibia"] = "NA"
+  rule["Nic"] = "NI"
+  rule["Norway"] = "NO"
+  rule["Peru"] = "PE"
+  rule["Phil"] = "PH"
+  rule["Pakistan"] = "PK"
+  rule["Sudan"] = "SD"
+  rule["Salv"] = "SV"
+  rule["Tonga"] = "TO"
+  rule["Vanuatu"] = "VU"
+
+  # Avoid collisions.
+  rule["Detroit"] = "Dt" # De = Denver
+
+  for (name in rule) {
+    record_hash(rule[name], name)
+  }
 }
 
 # Process an input line and save it for later output.
@@ -106,7 +224,7 @@ function process_input_line(line, field, end, i, n, startdef)
   i = field[1] == "Z" ? 4 : field[1] == "Li" ? 0 : 2
   if (i && field[i] ~ /^[^-+0-9]/) {
     if (!rule[field[i]])
-      rule[field[i]] = gen_rule_name()
+      rule[field[i]] = gen_rule_name(field[i])
     field[i] = rule[field[i]]
   }
 
@@ -146,6 +264,7 @@ function output_saved_lines(i)
 BEGIN {
   print "# version", version
   print "# This zic input file is in the public domain."
+  prehash_rule_names()
 }
 
 /^[\t ]*[^#\t ]/ {
-- 
2.7.4

-------------- next part --------------
From b9191cb16bf13fc8dcb9b14d0ecca8d662b48ee9 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert at cs.ucla.edu>
Date: Sun, 13 May 2018 17:47:25 -0700
Subject: [PROPOSED 4/4] Shrink SpainAfrica away
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* europe (Africa/Ceuta): Add no-op line that clarifies when Ceuta
reportedly used Morocco rules; this helps zishrink.awk.
* zishrink.awk (process_input_line): Omit SpainAfrica rules,
as they duplicate Morocco’s.
---
 europe       | 1 +
 zishrink.awk | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/europe b/europe
index 6994ed8..ed5ebb3 100644
--- a/europe
+++ b/europe
@@ -3424,6 +3424,7 @@ Zone	Africa/Ceuta	-0:21:16 -	LMT	1900 Dec 31 23:38:44
 			 0:00	1:00	WEST	1918 Oct  7 23:00
 			 0:00	-	WET	1924
 			 0:00	Spain	WE%sT	1929
+			 0:00	-	WET	1967 # Help zishrink.awk.
 			 0:00 SpainAfrica WE%sT	1984 Mar 16
 			 1:00	-	CET	1986
 			 1:00	EU	CE%sT
diff --git a/zishrink.awk b/zishrink.awk
index 21c71c0..9f07f0c 100644
--- a/zishrink.awk
+++ b/zishrink.awk
@@ -172,6 +172,11 @@ function process_input_line(line, field, end, i, n, startdef)
     if (line ~ /^R /) return
     line = substr(line, 1, RSTART) substr(line, RSTART + 5)
   }
+  # Replace SpainAfrica rules with Morocco, as they are duplicates.
+  if (match(line, / SpainAfrica /)) {
+    if (line ~ /^R /) return
+    line = substr(line, 1, RSTART) "Morocco" substr(line, RSTART + RLENGTH - 1)
+  }
 
   # Abbreviate times.
   while (match(line, /[: ]0+[0-9]/))
-- 
2.7.4