[tz] Non-ASCII outside comments?

Thu Jun 26 21:28:31 UTC 2014

Guy Harris wrote:
> But perhaps the documentation should indicate that:

Thanks, I gave that a shot with the attached patch.
-------------- next part --------------
From 100a7709130ef43afaf17debc637db846e12efbd Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert at cs.ucla.edu>
Date: Thu, 26 Jun 2014 14:23:53 -0700
Subject: [PATCH] * zic.8, NEWS: Document character encoding issues better.

(Thanks to Guy Harris for reporting the problem.)
---
 NEWS  |  5 +++--
 zic.8 | 18 +++++++++++++++++-
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/NEWS b/NEWS
index e87e2b7..2983cc0 100644
--- a/NEWS
+++ b/NEWS
@@ -28,8 +28,9 @@ Unreleased, experimental changes
 
     Documentation and commentary now prefer UTF-8 to US-ASCII,
     allowing the use of proper accents in foreign words and names.
-    Code and data have not changed because of this.
-    (Thanks to Garrett Wollman and Ian Abbott for helping to debug this.)
+    Code and data have not changed because of this.  (Thanks to
+    Garrett Wollman, Ian Abbott, and Guy Harris for helping to debug
+    this.)
 
     Non-HTML documentation and commentary now use plain-text URLs instead of
     HTML insertions, and are more consistent about bracketing URLs when they
diff --git a/zic.8 b/zic.8
index e22e6cd..2a1d29e 100644
--- a/zic.8
+++ b/zic.8
@@ -113,7 +113,7 @@ before 1970 or after the start of 2038.
 A time zone abbreviation has fewer than 3 characters.
 POSIX requires at least 3.
 .PP
-An output file name contains a byte that is not an ASCII letter, digit,
+An output file name contains a byte that is not an ASCII letter,
 .q "-" ,
 .q "/" ,
 or
@@ -135,8 +135,24 @@ rather than
 .B yearistype
 when checking year types (see below).
 .PP
+Input files should be text files, that is, they should be a series of
+zero or more lines, each ending in a newline byte and containing at
+most 511 bytes, and without any NUL bytes.  The input text's encoding
+is typically UTF-8 or ASCII; it should have a unibyte representation
+for the POSIX Portable Character Set (PPCS)
+<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html>
+and the encoding's non-unibyte characters should consist entirely of
+non-PPCS bytes.  Non-PPCS characters typically occur only in comments:
+although output file names and time zone abbreviations can contain
+nearly any character, other software will work better if these are
+limited to the restricted syntax described under the
+.B \-v
+option.
+.PP
 Input lines are made up of fields.
 Fields are separated from one another by one or more white space characters.
+The white space characters are space, form feed, carriage return, newline,
+tab, and vertical tab.
 Leading and trailing white space on input lines is ignored.
 An unquoted sharp character (#) in the input introduces a comment which extends
 to the end of the line the sharp character appears on.
-- 
1.9.1