[tz] Non-ASCII outside comments?
eggert at cs.ucla.edu
Thu Jun 26 21:28:31 UTC 2014
Guy Harris wrote:
> But perhaps the documentation should indicate that:
Thanks, I gave that a shot with the attached patch.
-------------- next part --------------
From 100a7709130ef43afaf17debc637db846e12efbd Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert at cs.ucla.edu>
Date: Thu, 26 Jun 2014 14:23:53 -0700
Subject: [PATCH] * zic.8, NEWS: Document character encoding issues better.
(Thanks to Guy Harris for reporting the problem.)
NEWS | 5 +++--
zic.8 | 18 +++++++++++++++++-
2 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/NEWS b/NEWS
index e87e2b7..2983cc0 100644
@@ -28,8 +28,9 @@ Unreleased, experimental changes
Documentation and commentary now prefer UTF-8 to US-ASCII,
allowing the use of proper accents in foreign words and names.
- Code and data have not changed because of this.
- (Thanks to Garrett Wollman and Ian Abbott for helping to debug this.)
+ Code and data have not changed because of this. (Thanks to
+ Garrett Wollman, Ian Abbott, and Guy Harris for helping to debug
Non-HTML documentation and commentary now use plain-text URLs instead of
HTML insertions, and are more consistent about bracketing URLs when they
diff --git a/zic.8 b/zic.8
index e22e6cd..2a1d29e 100644
@@ -113,7 +113,7 @@ before 1970 or after the start of 2038.
A time zone abbreviation has fewer than 3 characters.
POSIX requires at least 3.
-An output file name contains a byte that is not an ASCII letter, digit,
+An output file name contains a byte that is not an ASCII letter,
.q "-" ,
.q "/" ,
@@ -135,8 +135,24 @@ rather than
when checking year types (see below).
+Input files should be text files, that is, they should be a series of
+zero or more lines, each ending in a newline byte and containing at
+most 511 bytes, and without any NUL bytes. The input text's encoding
+is typically UTF-8 or ASCII; it should have a unibyte representation
+for the POSIX Portable Character Set (PPCS)
+and the encoding's non-unibyte characters should consist entirely of
+non-PPCS bytes. Non-PPCS characters typically occur only in comments:
+although output file names and time zone abbreviations can contain
+nearly any character, other software will work better if these are
+limited to the restricted syntax described under the
Input lines are made up of fields.
Fields are separated from one another by one or more white space characters.
+The white space characters are space, form feed, carriage return, newline,
+tab, and vertical tab.
Leading and trailing white space on input lines is ignored.
An unquoted sharp character (#) in the input introduces a comment which extends
to the end of the line the sharp character appears on.
More information about the tz