[ietf-charsets] [art] Fwd: [IANA #1297322] IANA characters-sets US-ASCII entry incorrect

Tue Dec 26 00:42:19 UTC 2023

Hello Steffen, others,

On 2023-12-21 04:54, Steffen Nurpmeso wrote:
> [i take iana completely off]

For the record, the ietf-charsets at iana.org mailing list setting has been 
changed so that mail from non-subscribers gets moderated, not just 
discarded. IANA will see how long they can do that (depends on the 
amount of spam they will get).

> Martin J. Dürst wrote in
>   <6a5eaaab-0c5e-4751-b007-090ce9b35a27 at it.aoyama.ac.jp>:
>   |To everybody interested in the recent discussion on the character set
>   |registry, in particular the (absence of the) entry "ASCII".
>   |
>   |Many thanks to IANA, and in particular Sabrina Tanamal, for digging up
>   |the relevant correspondence from 10 and 20 years ago.
> 
> Is it normal that bugtracker participants are not included in
> follow-up communication?  I have seen nothing.  Hm.  Only
> wondering.  But i agree, of course.

That mail from Sabrina was only sent to me. That's why I forwarded it.

>   |Please accept my apology for not remembering this correspondence and
>   |therewith seriously confusing the discussion.
>   |
>   |My summary based on this new information is as follows:
>   |
>   |- Ned Freed sent a request to IANA in February 2003 concerning the
>   |   entry for "US-ASCII" and related aliases in the Character Set
>   |   registry, requesting (among else) the removal of the alias "ASCII".
>   |   The request for this removal was based on the fact that RFC 2046 says
>   |   'The character set name "ASCII" is reserved and must not be used for
>   |   any purpose.'.
>   |
>   |- When IANA was moving their registries from .txt to .xml, this request
>   |   was rediscovered and acted upon. Both Ned and me agreed with the
>   |   removal of "ASCII". We decided that there was no need to inform
>   |   the ietf-charset mailing list, which in hindsight was probably a
>   |   mistake (not the least because it would have had the potential to
>   |   shorten the current discussion by quite a bit).
>   |
>   |Given the fact that RFC 2046 clearly says that 'The character set name
>   |"ASCII" is reserved and must not be used for any purpose.', I think that
>   |the only choice is to leave the registry as it is.
>   |
>   |<charset reviewer hat on>
>   |I'm of course ready to reevaluate this and adding this label back in if
>   |anybody is able to come up with really strong and convincing arguments
>   |to do so.
>   |<charset reviewer hat off>
> 
> So i would ask to please do this.

To put the conclusion first, I'm sorry, but you haven't convinced me. As 
John Klensin puts it in a very recent mail, you would have to create an 
internet-draft and gain IETF consensus on it to update RFC 2046.

> The term "ASCII" is in use for longer than my lifespan, and
> i cannot understand why a paragraph in some MIME-related RFC
> overrules a generic, omnipresent standard that was established in
> 1968.

[Copying your last sentence here, because this simplifies my answer.]

 > But again, if it would be only about MIME, i would be fine, and
 > end just with this rhyme.

My understanding is that this is indeed only or mostly about MIME. I 
included 'mostly', because the registry also covers MIB, which is 
certainly not MIME, and there may be other IETF or external standards 
referring to it.

This is different from other standards groups or software (OSes, 
libraries, languages, and up).

As an example, you mention the character encoding conversion library 
iconv below. One version on one of my systems (iconv (Ubuntu GLIBC 
2.35-0ubuntu3.1) 2.35), with `iconv -l`, produces a long list of overall 
1179 labels for encodings. That includes a lot of (what is easy to guess 
are) aliases such as 8859_1, ISO-8859-1*, ISO8859-1, ISO88591, 
ISO_8859-1*, ISO-8859-1:1987*. Only the ones with a * (plus some others 
that I'm sure iconv also listed but didn't have the time to check) are 
in the IANA registry.

Overall, as mentioned above, my iconv version listed 1179 labels, and 
the IANA registry lists 887 unique names/preferred MIME names/aliases. 
There is quite some overlap, but there are 626 labels listed by iconv 
that are not in the IANA registry, and there are 334 aliases in the IANA 
registry that are not listed as labels by iconv. I haven't done an 
analysis of how many actual encodings these numbers cover. I'd have to 
dive into the iconv source code, and I'd rather avoid doing this because 
I think it doesn't really contribute much to this discussion.

> I want to point to standards like POSIX from the OpenGroup, and to
> other billions of distributed packages, including manuals, which
> refer to the name ASCII as a shorter synomym (and not more) to the
> standard ANSI_X3.4-1968, and its further line that goes to
> ANSI_X3.4-1986, ISO_646.irv:1991, ISO646-US etc.

That's just fine. The IETF cannot and doesn't dictate how other 
standards organizations or software producers write their documents.

> It is something different if the _IANA character set database_
> manages _(preferred) MIME alias names_ in conjunction with
> character sets, and it is absolutely fine if some MIME RFC gives
> or takes on specific aliases.  But ASCII is plain forever meaning
> what it means, since 1968.

Apparently not. John Klensin gave a very recent example. At least in the 
90ies, "ASCII" must have been used in many ways, or we wouldn't have the 
very specific sentence in RFC 2046. I can easily imagine people having 
said "this file is ASCII, not EBCDIC", and the reason such a phrase is 
rare today is only due to the fact that EBCDIC is rare today.

> Maybe i understand the IANA character set database wrong?
> I would understand this decision of yours in respect to the
> "preferred MIME name".
> 
> But Let me for example point to the really great ones from Bell
> Labs, who gained the U.S. "National Medal of Technology" not for
> nitpicking or overengineering, and for example, five days before
> i was born, on 1972-06-20, during Research V1 development, created
> the file u9.s with content as
> 
> +       bisb    200,r1 / if entry is less than 0 add 128 to ASCII
> +                      / code for char to be output
> 
> +       bit     $1,r1 / is bit 0 of ASCII char = 1 (char = lf)

The IETF isn't Bell Labs. And in the 1972 timeframe in the US, character 
encodings for e.g. European languages were not really a concern at all.

> For UNIX v3 we already have a manual page ascii(7):
> 
>    CommitDate: 1973-02-13 15:37:00 -0500
> 
>        Research V3 development
>        Work on file man/man7/ascii.7
> 
>        Co-Authored-By: Dennis Ritchie <dmr at research.uucp>
>        Synthesized-from: v3
>    ---
>     man/man7/ascii.7 | 37 +++++++++++++++++++++++++++++++++++++
>     1 file changed, 37 insertions(+)
> 
>    diff --git a/man/man7/ascii.7 b/man/man7/ascii.7
>    new file mode 100644
>    index 0000000000..ee8839580d
>    --- /dev/null
>    +++ b/man/man7/ascii.7
>    @@ -0,0 +1,37 @@
>    +.pa 1
>    +.he 'ASCII (VII)'6/12/72'ASCII (VII)'
>    +.ti 0
>    +NAME           ascii  --  map of ASCII character set
>    +.sp
>    +.ti 0
>    +SYNOPSIS       c__ /_____________
>    +.sp
>    +.ti 0
>    +DESCRIPTION    a____
>    +is a map of the ASCII character set, to be printed as needed.
>    +It contains:
>    +.in -16
>    +.nf
>    +
>    +|000 nul|001 soh|002 stx|003 etx|004 eot|005 enq|006 ack|007 bel|
>    +|010 bs |011 ht |012 nl |013 vt |014 np |015 cr |016 so |017 si |
>    +|020 dle|021 dc1|022 dc2|023 dc3|024 dc4|025 nak|026 syn|027 etb|
>    +|030 can|031 em |032 sub|033 esc|034 fs |035 gs |036 rs |037 us |
>    +|040 sp |041  ! |042  " |043  # |044  $ |045  % |046  & |047  ' |
>    +|050  ( |051  ) |052  * |053  + |054  , |055  - |056  . |057  / |
>    +|060  0 |061  1 |062  2 |063  3 |064  4 |065  5 |066  6 |067  7 |
>    +|070  8 |071  9 |072  : |073  ; |074  < |075  = |076  > |077  ? |
>    +|100  @ |101  A |102  B |103  C |104  D |105  E |106  F |107  G |
>    +|110  H |111  I |112  J |113  K |114  L |115  M |116  N |117  O |
>    +|120  P |121  Q |122  R |123  S |124  T |125  U |126  V |127  W |
>    +|130  X |131  Y |132  Z |133  [ |134  \\ |135  ] |136  ^ |137  _ |
>    +|140  ` |141  a |142  b |143  c |144  d |145  e |146  f |147  g |
>    +|150  h |151  i |152  j |153  k |154  l |155  m |156  n |157  o |
>    +|160  p |161  q |162  r |163  s |164  t |165  u |166  v |167  w |
>    +|170  x |171  y |172  z |173  { |174  | |175  } |176  ~ |177 del|
>    +
>    +.fi
>    +.in +16
>    +.sp
>    +.ti 0
>    +FILES          found in /usr/pub
> 
> This is after Vincent Cerf's RFC 20 from 1969-10-16.
> But shows the deep penetration of the standard and the name.

In 1973, yes. Later, than name penetrated even deeper, into areas where 
it was very strictly speaking not appropriate. And that led to the 
sentence in RFC 2046.

> So letting aside that "US-ASCII" appears twice in the current DB
> varient, as a name and an alias, which is outstanding in this DB,
> and surely an editor error.

There are quite a few entries where a name appears as an alias. If it 
doesn't help, it definitely doesn't hurt either.

> The name ASCII is well-known and cannot be detoriated by some
> nitpicking in a MIME related RFC that only a rather small set of
> people in some dark corner of the internet know about.
> 
>   |For data labeled with charset=ASCII, the correct interpretation is to
>   |ignore the charset parameter because of an undefined parameter value.
> 
> See -- in my eyes this has no relation whatsoever with application
> code in the wild, as at least one all UNIX systems like Linux etc
> which use the iconv(3) series of functions, as well as any
> scripting language which uses that under the hood, you call
> iconv_open(charset-1, charset-2), and iconv does everything for
> you.  This is an unfortunate side-effect of the closed-box iconv
> interface, which is capable of everything under the hood, but does
> not expose this to users.  (Ie name normalization.)
> 
> Anyhow iconv(3) will, in reality, treat this ASCII as US-ASCII,
> unless it uses automatic synchronization with the IANA DB.
> It seems the most widely used implementations do not do that,
> luckily.  And should not.
> 
>   |The implementation would then fall back to the default, which in case of
>   |email and text/plain is "US-ASCII". The overall result is the same as an
>   |"ASCII" alias in the registry.
> 
> I would *instead* think that most software implements some kind of
> name normalization, and silently treat ASCII as the MIME name
> US-ASCII.

Please note that software is free to implement stuff the way they want, 
as long as the results are the same. And they are the same here.

> ..so, for example number one terminal mutt(1) does exactly that,
> it funnily misses the notorious :1986 variant that the MIME RFC
> points to so deliberately.. but it knows 1968.
> Black panther fist here.
> 
> I am stunnded to detect that number two, alpine, University of
> Washington (in the past, but now "a fellow in private") does it
> differently, and only maps from ASCII to US-ASCII for outcome of
> the function nl_langinfo().  It seems to have been rewritten for
> a very late version (the last maybe even before becoming
> a private project) to be UTF-8 centric, says the changelog.
> 
> But again, if it would be only about MIME, i would be fine, and
> end just with this rhyme.

Well, it's indeed (mostly) only about MIME.

Happy holidays to everybody,   Martin.

> --steffen