[gnso-rds-pdp-wg] Five models of RDS (was Re: Apologies, and some reflections on requirements)

Thu Jun 30 16:34:58 UTC 2016

Hi,

Reading further in the thread, I realise that perhaps not everything
that I said was perfectly clear.  I'll respond to some other mails
downthread, but before I do that I want to make sure we're all talking
about the same thing.  Some of this explanation is in some of the
background material we have, but the relation to what I'm talking
about obviously isn't.  So here's some more explanation.  This may be
a little tedious for those already familiar with the history, but it
seems better to lay this out in more detail so that it's clear what
we're talking about.

When I respond to other mails, I'll refer to these "Model I" through
"Model V" descriptions, because these models are what I was thinking
about when I wrote my earlier note.

On Fri, Jun 24, 2016 at 10:03:33PM -0400, Andrew Sullivan wrote:

> Since the very introduction of the competitive-registrar model (and
> arguably before that), the RDS has been a distributed database.  It is
> far less successful than the other distrubuted database we all know
> and love -- DNS -- but it is nevertheless distributed.

I have a feeling that people have conflated the "distributed" and
"federated" discussion with some other points I was trying to make
(and indeed, I'm not sure I was totally clear).  So let me try to walk
through a bunch of different ways that any RDS can work.

I know that some people find diagrams easier, so I've tried to create
some.  I'm sorry I'm so bad at diagrams.  Since email is a lousy way
to inline diagrams, I'll make some references to some external
diagrams I've shared.

MODEL I

Consider
<https://docs.google.com/drawings/d/1KCLGP8pClFTLoGp1uqVF-ao80NRP3QAxkkies5h3NDQ/edit?usp=sharing>.
This is an approximate picture of the very earliest registration and
RD systems.  You registered a name with the NIC.  The NIC also
prepared the NICNAME directory, which was originally literally a
document, published on paper.  Oral histories tell me that the very
earliest version of the "store" was a piece of paper kept in Jon
Postel's pocket, but how true that is I don't know (I was certainly
not involved at the time, since I was in senior elementary school and
didn't have a connection to the Internet).

There appear to have been small iterations on this model, but the
central features are that, for any given registry (or registry
operator, in fact), there is a single point of registration and a
single source of data for any registr data service.  I've called this
Model I.  To be clear, in this model there _could_ be multiple data
sources for all registration data, because (for example) some country
code TLDs ran their registries separately using this model.  In that
iteration of the model, whois clients had built-in lists of whois
servers to consult.

MODEL II

The big change in the early whois evolution was the addition of rwhois
and friends, which corresponded to the period in which registrars were
added to the registration systems.  I drew a diagram at
<https://docs.google.com/drawings/d/1keKN1__qoboMQ2vEmsY6bnnUpLktQmF1GtjJ8hrMSoI/edit?usp=sharing>.
This picture is the "thin registry" model.  I've called it Model II.
There are several features to note here.

First, notice that the registration path involves a convolution.  The
registrant sends data to the registrar.  The registrar has a bunch of
needs for data, some of which are necessary for the business processes
of the registrar and some of which are necessary for the functioning
of the domain name.  The registrar stores some or all of that data in
a local registrar store.  Also, such data as is needed for the
registry is either passed through to the registry (the dotted line) or
else copied from the registrar store to the registry (the solid line),
using whatever protocol the registry used (some sort of
registry-registrar protocol, or else EPP).  The registry then also
stores some data -- the data for registration (which is usually just
the name, the registrar, and the name servers and necessary glue if
provided).

Second, in order to support this mode of working, whois had to change.
The naïve use of a whois query (whois $domainname) required some
adjustments.  A client needs to know which server to start querying
for a given domain name.  (In general, even today, this is a static
list compiled into the whois client, though many clients allow you to
specify an alternate.  In the earliest days, this feature was not
widely deployed, which is how so much stale data "leaked out" --
people would query the wrong server, and get its answer.)  When the
registry whois server replied, it would provide a referral to the
registrar's whois.  Then the client would additionally query the
registrar's whois, and get data like the name and contact of the
registrant and so on.  The client would assemble this somehow, and
then present it to the user.  Note that this last step might not be
using the whois protcol -- in particular, most of the "web whois"
systems are basically a user interface via http(s) to a port 43 whois
client.

As an aside, it's worth noting that the "referral" part of this system
didn't work that well, mostly because of the whois protocol.  Whois
was designed to be read by humans, so the protocol is dead stupid:
connect to port 43, send a string, and receive a bunch of strings
back.  There is no data format whatsoever, and therefore when you
wanted to communicate something machine-readable in the data that came
back (like, for instance, "here's a referral"), there was no reliable
way to do it.  Instead, the client had to parse the response and
figure out which parts were supposed to be instructions to the client
instead of something the human should see.  This is perhaps the
least-reliable way to make a protocol ever, and it didn't work very
well.  The many efforts to standardise whois output formats within
ICANN have all been gross hacks on top of this basic problem: if you
make the output format sufficietly consistent, machines get better at
processing that output.  (This is why rwhois got better over time --
see below.)

Third, because a client can initiate a query directly against a whois
server, the client could ask a registrar who is no longer the actual
registrar for a name about that name.  This can result in "wrong"
data, because the registrar might have lost the registration in the
past and have old data hanging around.  This was at one time a major
problem, and an awful lot of policy that exists today is quite
obviously an attempt to solve this problem (which is as a matter of
fact no longer really that interesting -- the technical reality has
made some of that policy obsolete, but people continue to insist on
solving a problem they had in 1999).

Note that this model is the beginning of what I was describing as a
"distributed system".  Data comes from multiple sources, and it is
assembled by the end client into a whole that answers the initial
query.

Finally, note that I've drawn this with common data stores for
registration and whois at the registry and registrar.  In
implementation, the actual database systems that the whois servers use
might be different than the one the registration servers use, and
there might even be some transformation (one system I worked on, for
instance, precompiled whois answers so that the whois server went
faster).  But the point is that the data store is operated by the same
operator as the registration database (i.e. the operator of the RDS is
the operator of the server that accepts the incoming data).

MODEL III

Partly in response to the unreliability of rwhois, and partly I
suspect because people didn't really trust registrars to do their job,
many registries (some for contractual reasons) adopted a "thick whois"
model.  This is just a patch on top of Model II:
<https://docs.google.com/drawings/d/183F855BODDVIt0IriSGU35EyoaLYpLQMUxAk4RsNo-Y/edit?usp=sharing>.
The basic system works exactly the same way, but when the client
queries the registry whois server it gets a complete response instead
of having to ask the registrar for more data.  For reasons I never
fully understood, registrars still had to maintain a whois server for
these registries, so it remains possible to ask a registrar about a
domain name (the orange lines in the diagram).  Just as in Model II,
the registrar will respond with what it has, which might not be
correct.  But you're way less likely to ask the registrar unless
you're on purpose querying the wrong place.

This is the model that most contracted registries are using today.  It
is slightly less distributed than Model II because the registries
provide the data from a central point for that registry.  It's still
distributed in that each registry answers for its area of authority.

MODEL IV

I've illustrated the basic approach that I think most of us involved
in specifying RDAP had in mind in
<https://docs.google.com/drawings/d/1HftBWzxA4DGwa0-PMCrpxXkK0Nwpm8ZS8Rsql6VtkJQ/edit?usp=sharing>.
This is fundamentally a modification of Model II, in that it is a
distributed query and distributed store system.  RDAP does some things
differently, however.

First, its output is JSON documents, which can be used reliably in
multiple different ways (including being parsed by a machine or
formatted for presentation to humans).

Second, there's a bootstrap mechanism in which, to get started, you
query IANA.  Note that some whois clients had already started to do
this for Model II, but it was nowhere specified.  Now it's just part
of how things work.

Finally, RDAP has built in the idea that the client could be
authenticated.  This automatically means that different responses can
be returned depending on the credentials the client presents.

Note that the protocol for this is all https, so there's no more port
43 traffic arising from this.  Note also that you can do the same sort
of thing with a "thick registry" model -- the registry doesn't send a
referral in that case.  This model is exactly as distributed as Model
II or III, depending on which approach we take.

MODEL V

Model V is a little harder to illustrate, because it's not clear
(especially in our discussions) what protocols it'd use.
Nevertheless, I've sketched it at
<https://docs.google.com/drawings/d/1c3G3guMO7-IFtm-D1FIdEXXP9_ww8Op4Ul9N1ypTG2U/edit?usp=sharing>.
The key thing here is to notice that the model takes a bunch of data
from disparate sources, federates it (somehow) into a single data
store, and then offers the federated RDS to clients that query from
the Internet.  There are some interesting consequences of this.

First, note that this is the only model in which the answers to an RDS
query come from a party that is not directly responsible for
collecting some data.  Model III went partway toward that by having
registries hold data that is really only relevant to registrars.
Model V takes this all the way to its conclusion, in that the data
store backing the RDS is operated by someone who doesn't collect any
of that data from the original source.  

Second, in order to make this happen, a federation process of some
kind needs to be designed, written, and tested.

Presumably the RDS could use RDAP or whois for its protocols, or we
could specify that we need something else.  I believe that the IETF
will only support RDAP for this use, so if we decide to specify that
some other protocol is needed then I guess we'll have to find someone
to develop that specification before we can proceed.

Finally, this model is as monolithic as the system can get: it
basically re-invents the Whois model of the pre-registrar period, when
the data in the WHOIS was small enough that it could be put into a
mimeographed booklet.  

I hope this is helpful.

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com