[ietf-charsets] [art] Fwd: [IANA #1297322] IANA characters-sets US-ASCII entry incorrect

Tue Dec 26 02:06:29 UTC 2023

Oh!  This is an interesting co-incidence over so many thousand
kilometres.

Martin J. Dürst wrote in
 <b7b75ef4-3e3d-47c8-bdb5-c6d3d4376042 at it.aoyama.ac.jp>:
 |On 2023-12-21 04:54, Steffen Nurpmeso wrote:
 |> [i take iana completely off]
 |
 |For the record, the ietf-charsets at iana.org mailing list setting has been 
 |changed so that mail from non-subscribers gets moderated, not just 
 |discarded. IANA will see how long they can do that (depends on the 
 |amount of spam they will get).

I would not have thought this thread becomes so long.
I am not a character set expert -- in fact this year i have read
a Unicode "minutes" document, and it was a terribly dense document
for people with a tremendous background of track items and
discussions in forums and meetings, all of which i did not took
part in.
(The full truth is that i personally have always been jealous of
people who can speak and write many cultural achievements,
especially so if covers asian languages, where matured minds
express art and being in calligraphic craftsmanship.
But this is off-topic, of course.)

She did contact me thereafter, too.
I was only a bit distressed because any email to iana at iana.org
opened a new ticket, if i understood that correctly.

  ...
 |>|<charset reviewer hat on>
 |>|I'm of course ready to reevaluate this and adding this label back in if
 |>|anybody is able to come up with really strong and convincing arguments
 |>|to do so.
 |>|<charset reviewer hat off>
 |> 
 |> So i would ask to please do this.
 |
 |To put the conclusion first, I'm sorry, but you haven't convinced me. As 
 |John Klensin puts it in a very recent mail, you would have to create an 
 |internet-draft and gain IETF consensus on it to update RFC 2046.

But, i am still astonished, isn't the IANA character set database
a database of character sets as such?  My 2011 version reads

  These are the official names for character sets that may be used
  in the Internet and may be referred to in Internet
  documentation.  These names are expressed in ANSI_X3.4-1968
  which is commonly called US-ASCII or simply ASCII.

It is totally understandable if the "preferred MIME name" is set
in the spirit of RFC 2046, which defines MIME (media types).
But the "name" should refer to a "real" standard, and a given
Name: should not have an equal Alias:.  At least no other entry
has.  (In the meantime, however, i saw that ISO-8859-10 does also
not use the real standard name as Name:, but it is still an
outstanding percentage.)

 |> The term "ASCII" is in use for longer than my lifespan, and
 |> i cannot understand why a paragraph in some MIME-related RFC
 |> overrules a generic, omnipresent standard that was established in
 |> 1968.
 |
 |[Copying your last sentence here, because this simplifies my answer.]
 |
 |> But again, if it would be only about MIME, i would be fine, and
 |> end just with this rhyme.
 |
 |My understanding is that this is indeed only or mostly about MIME. I 
 |included 'mostly', because the registry also covers MIB, which is 
 |certainly not MIME, and there may be other IETF or external standards 
 |referring to it.
 |
 |This is different from other standards groups or software (OSes, 
 |libraries, languages, and up).
 |
 |As an example, you mention the character encoding conversion library 
 |iconv below. One version on one of my systems (iconv (Ubuntu GLIBC 
 |2.35-0ubuntu3.1) 2.35), with `iconv -l`, produces a long list of overall 
 |1179 labels for encodings. That includes a lot of (what is easy to guess 
 |are) aliases such as 8859_1, ISO-8859-1*, ISO8859-1, ISO88591, 
 |ISO_8859-1*, ISO-8859-1:1987*. Only the ones with a * (plus some others 
 |that I'm sure iconv also listed but didn't have the time to check) are 
 |in the IANA registry.

Yes, there seems to have been a lot of variants in the iconv
implementations in the wild, from hearsay Solaris was especially
known for this.  The ISO variants without separation were BSD
specific ~twenty+ years ago, if i recall this correctly.
So users want to see the character set name they know.
(However, in practice, and internally, character sets are
normalized in order to reduce the real set used.  I did that, and
i think in 2011 i was shortly a bystander when the python3
language, then not yet completed, talked about the normalization
of character names.  And that variable python string type was
implemented.  Maybe these are listening here, Martin von Löwis
implemented this string, Victor Stinner was talking on the
normalization algorithm.  It was a bit different, but still...
Anyhow iso8859-1, iso_8859-1, iso-8859-1, it all boils down to
"iso 8859 1".)

 |Overall, as mentioned above, my iconv version listed 1179 labels, and 
 |the IANA registry lists 887 unique names/preferred MIME names/aliases. 
 |There is quite some overlap, but there are 626 labels listed by iconv 
 |that are not in the IANA registry, and there are 334 aliases in the IANA 
 |registry that are not listed as labels by iconv. I haven't done an 
 |analysis of how many actual encodings these numbers cover. I'd have to 
 |dive into the iconv source code, and I'd rather avoid doing this because 
 |I think it doesn't really contribute much to this discussion.

I would avoid that if i were you, yes.

 |> I want to point to standards like POSIX from the OpenGroup, and to
 |> other billions of distributed packages, including manuals, which
 |> refer to the name ASCII as a shorter synomym (and not more) to the
 |> standard ANSI_X3.4-1968, and its further line that goes to
 |> ANSI_X3.4-1986, ISO_646.irv:1991, ISO646-US etc.
 |
 |That's just fine. The IETF cannot and doesn't dictate how other 
 |standards organizations or software producers write their documents.

Sure, but ASCII has an official standard name, and the first one
such was ANSI_X3.4-1968, right?  These names do not come from
nowhere, they are no plain IETF-only inventions?

 |> It is something different if the _IANA character set database_
 |> manages _(preferred) MIME alias names_ in conjunction with
 |> character sets, and it is absolutely fine if some MIME RFC gives
 |> or takes on specific aliases.  But ASCII is plain forever meaning
 |> what it means, since 1968.
 |
 |Apparently not. John Klensin gave a very recent example. At least in the 
 |90ies, "ASCII" must have been used in many ways, or we wouldn't have the 
 |very specific sentence in RFC 2046. I can easily imagine people having 
 |said "this file is ASCII, not EBCDIC", and the reason such a phrase is 
 |rare today is only due to the fact that EBCDIC is rare today.

It says

   NOTE: RFC 821 explicitly specifies "ASCII", and references an earlier
   version of the American Standard.

Whereas they goes for the -1986 version.
They refer to "[other] national" character sets derived from "ISO
646", like ISO-2022, which come over 7-bit clean, which is a real
problem in practice, that only the next release of the MUA
i maintain will address really correctly, despite it being 45
years old, soon 46, half of which with MIME support.  Sure.
They say

   The defined charset values are:

    (1)   US-ASCII -- as defined in ANSI X3.4-1986 [US-ASCII].

So this is that.  But this is the usage for MIME.
And, as the thread has shown, ASCII was in use already in the
wild, and still is sometimes.  In practice MIME capable mail
programs simply use iconv(3), unless they normalize the name
themselve.  It is normalized here or there, but better it is if
you can identify it yourself, so as to treat it as the plain
US-ASCII of RFC 5322 / RFC 2046.
As iconv does not help you, even though it knows internally, that
is the only way to get it right, right?

 |> Maybe i understand the IANA character set database wrong?
 ...
 |The IETF isn't Bell Labs. And in the 1972 timeframe in the US, character 
 |encodings for e.g. European languages were not really a concern at all.
 ...
 |> This is after Vincent Cerf's RFC 20 from 1969-10-16.
 |> But shows the deep penetration of the standard and the name.
 |
 |In 1973, yes. Later, than name penetrated even deeper, into areas where 
 |it was very strictly speaking not appropriate. And that led to the 
 |sentence in RFC 2046.

But it refers to other "national and application-oriented versions
of ISO 646", by no means to the name "ASCII"?

 |> So letting aside that "US-ASCII" appears twice in the current DB
 |> varient, as a name and an alias, which is outstanding in this DB,
 |> and surely an editor error.
 |
 |There are quite a few entries where a name appears as an alias. If it 
 |doesn't help, it definitely doesn't hurt either.

Really?  I have found duplicate when i looked last.
(In the 2011 data.)

  ...
 |Please note that software is free to implement stuff the way they want, 
 |as long as the results are the same. And they are the same here.

Hmmm, i do not think you are right.  No, i think you are wrong.
You give the reason with ISO-2022-JP yourself.
If i cannot realize that "ASCII" is indeed "US-ASCII", i have to
use iconv(3) (or a similar mechanism) to convert the data.
If i do so, this conversion as such may fail, dependent on the
data.  Not the iconv_open(3), as you surely would now say "if that
fails, just treat it as the default character set" (which i only
should do if the data is 7-bit clean), but an ILSEQ during
conversion.
If i know the data is US-ASCII, no such conversion is possibly
ever performed.  (Dependent on whatever operation there is to do.)

Ciao,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)
|
| Only in December: lightful Dubai COP28 Narendra Modi quote:
|  A small part of humanity has ruthlessly exploited nature.
|  But the entire humanity is bearing the cost of it,
|  especially the inhabitants of the Global South.
|  The selfishness of a few will lead the world into darkness,
|  not just for themselves but for the entire world.
|  [Christians might think of Revelation 11:18
|    The nations were angry, and your wrath has come[.]
|    [.]for destroying those who destroy the earth.
|   But i find the above more kind, and much friendlier]