[gnso-rds-pdp-wg] Apologies, and some reflections on requirements

Sat Jun 25 02:03:33 UTC 2016

Dear colleagues,

Apologies first.  I'm not going to be in Helsinki.  I'm in the middle
of a move from NH back to Toronto, and it turns out that my movers'
understanding of, "I need to leave on $date," entails arranging things
such that goods will arrive after $date.  Alas, in this case the goods
arrive Monday.  I will attempt to follow the ICANN meetings remotely
next week, but I expect it will be tricky.

I have been deeply dissatisfied with the way the work is going, and I
believe it is because I see a mismatch in what we are trying to do and
the kind of system we are trying to do it to.  In particular, I think
we are trying to treat the RDS as a single monolithic system, and
attempting to build "requirements" that match that assumption.  Here
is an effort to sketch why I think that.  I didn't have time to write
a short note, &c. &c.  Sorry this is long.

Since the very introduction of the competitive-registrar model (and
arguably before that), the RDS has been a distributed database.  It is
far less successful than the other distrubuted database we all know
and love -- DNS -- but it is nevertheless distributed.

The distribution comes from different parties having various parts of
the data.  In so-called "thin" registries, this was always the case.
The registry has names and nameservers, and since the invention of
registrars knows who the registrar is.  But if you wanted to know
certain kinds of data, you had to ask the registrar in question.

Because in (say) 1999-2001 nobody had anything better than the
whois/rwhois/whois++ protocol(s) to deliver this kind of data, a whole
bunch of bad compromises got enshrined in policy.  First, we continued
to use whois and its descendents (anything on port 43) as the model
for all of this.  The plain fact is that whois was obsolete nearly at
birth.  It's a terrible protocol, and should be taken behind the ice
house and put out of its misery.

Second, in order to "fix up" whois, clients were created all over the
Internet that built in a bunch of assumptions about whom to ask for
what data.  The consequence of this was that clients routinely got bad
data as they queried the wrong server.  Old registrar data hung around
even after a transfer.  When I worked on the org transition from
Verisign to PIR in 2003 (?), it took a long time before whois clients
stopped asking Verisign about org data.  And so on.

Third, in an attempt to hack around the above technical flaws in an
already-obsolete protocol, "thick whois" gained popularity in possibly
the worst possible arrangement known to data science.  Instead of
insisting that registries hold the data and that registrars and
everyone else treat the registry data as The Truth, we created "thick"
whois in registries _without allowing registrars to stop their
service_.  Any half-competent database person will tell you that
storing "the same data" in two places that don't have tight
connections is an excellent way to create data inconsistency, but is
not a good way to arrive at the truth.  (Latterly, as though
illustrating the tendency of people to double down on bad ideas, there
have been suggestions that ICANN should run the One Giant RDS of the
Universe and hold all the data in a central place.  What could
possibly go wrong?)

The thread running through this history of error is the idea that the
RDS is one system.  But like the DNS, it only appears to be one
system.  It's actually a "distributed database", where in this case
the distribution is separable on organization lines.  That is,
registries -- including ICANN, who can be thought of in this case as
both the registry and registrar for the root zone -- have some data.
Registrars have some other data.  Resellers and privacy/proxy services
have yet other data.  In many cases, the data does not need to be
shared across these organizational lines to make it queryable by humans.

The reason that isn't clear to most of us is because whois -- the RDS
we use today -- _was_ designed as a monolithic system.  It was
designed that way because back when it was created -- RFC 812 is from
_1982_! -- the database _was_ a monolithic database.  Whois (the
protocol and the client program) continues to have all the
deficiencies for distributed use that you might expect of a program or
protocol designed to talk to exactly one authoritative service.
Whois++ and rwhois attempted to graft on to this basic protocol some
distributed operation, but the graft didn't really take and the
ornamental shrub now looks like a weed.

People have nevertheless internalized the whois-based thinking, which
is why we keep asking things like, "What data should be collected?"
In a distributed system like this, that's barely interesting, for the
commercial interests in this case all militate against collecting data
that nobody needs for any function.  Instead, we should ask what data
should be collected _by different actors_.  This implicitly involves
describing what those actors are doing to require the data.

The nice thing, of course, is that protocol designers have done _a
lot_ of this work for us, when they were working on RDAP.  They did
this because they were trying to come up with use cases for the
protocol, which finally did away with the monolithic-system thinking
of whois and offers us a protocol designed precisely to work in the
distributed-database environment that is the actual registration
system.  That we even still have a work step that involves evaluating
what protocol we're going to use for all this makes me a little ill.

It seems to me that we can just say that we have to embrace the
distributed-database fact.  For first, it's a fact of how registration
actually works now.  If we don't agree with that, I think we should
give up.  Second, it's consistent with how every single other thing on
the Internet that has not crashed and burned works.  The Internet
cannot scale depending on monolithic systems.  And nobody has the
power to impose one anyway.

Once we have done that, there are still important policy issues about
what data ought to be collected by anyone, under what conditions they
might reveal it to someone else (and who that someone else is), and so
on.  But there are empirical tests for whether some of the answers
people are proposing really match the distributed nature of the
system.  If they don't, we can close off those avenues of inquiry,
because they'll never be productive.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com