[council] New gTLD "validation" problems...

Thomas Roessler roessler at does-not-exist.org
Thu Aug 14 18:05:46 UTC 2003

I did a little Google search for e-mail validation tools and tricks,
and checked for assumptions on what TLDs look like.  (Searches:
"email validation javascript", "email valid javascript", "email
validation asp", and so on.)

Looking at the results, there's an awful lot of sometimes bad,
sometimes horrible code (in JavaScript, VBScript, Perl, and friends)
and regular expressions around. 

For a sample dirty dozen, see the link list in the end of this
e-mail.  All of these are from either the Google top ten or top
twenty for some of the searches I did.

The most typical mistakes include assuming that:

- TLDs are 2-N characters long, with N ranging anywhere from 3 to 6
  (In fact, 3 and 6 seem to be the most common upper bounds assumed;
  the maybe most absurd case had 4 and a comment explicitly
  referencing .info...)

- It is a good idea to have a hard-coded list of TLDs.  Such lists
  frequently *include* the current set of new gTLDs, so these are
  good news for the current new gTLD operators, and really bad news
  for the next round.

  (In one case, there was at least a comment referencing ICANN and
  the need to update -- but, of course, these JavaScript code
  snippets are the kind of stuff which gets deployed and forgotten,
  so that comment is worthless.)

Remarkably, most of the code I looked at just accepted two-letter
TLDs, with just one (probably not so popular) exception that would
only accept ".tv" and ".us".

In general terms, I'd suggest that any advisory the GNSO may
initiate on the topic of acceptance problems with respect to new
TLDs should generally take up the basic theme that the root zone is
a dynamic thing, and that operators and programmers should not make
unwarranted assumptions on what's in there.

Besides the kinds of programming errors mentioned above, that brings
up two more dangerous practices:

1. Downloading a copy of the root zone, installing that on a
   resolver running bind, practically turning that resolver into a
   root server.  If the root zone copy isn't updated regularly,
   things will break -- not just when new TLDs are added, but also
   when existing TLDs migrate to different servers.  (What's the
   transition plan for .org, again?) I have no idea how common this
   kind of setup actually is.

2. Using fake TLDs for local networks.  It's not uncommon to just
   use a random, unused TLD for machines on an intranet; these host
   names aren't supposed to be seen on the Internet.  Of course,
   it's extremely easy to screw up this kind of setup, and to
   inadvertently create a "local" collision with a future TLD.
   Fixing setups like this might get quite costly.

At the same time, all this indicates that the "visibility" problems
for new gTLDs will persist for quite some time.

The dirty dozen address validators:


Thomas Roessler			      <roessler at does-not-exist.org>

More information about the council mailing list