[gnso-rds-pdp-wg] Some reasoning about non-contact-data (was Re: key concepts: say "contact data" when that is what we mean)

Wed Dec 7 16:13:21 UTC 2016

Hi,

First, let me start by agreeing very strongly with Greg that we can
make some big gains by distinguishing between what he calls "thin
data" (and what I think of as "name data" -- that is, data about the
name _as such_) and "contact data" (what I think of as "registration
data" -- that is, data about the registration: who did it, whom you
can contact in the event of trouble, how to do so, and so on).

In the interests of pushing forward along those lines, I'd like to
take the position in this mail that the first class of these is one
"tranche", and that each such field can be considered.  Below I
consider each such field and the arguments for and against competely
unauthenticated, public access to it.  I'm not actually sure I agree
with Greg that it is not PII and noncontroversial, but I certainly
agree that it is _less_ PII and way less controversial.

On Wed, Dec 07, 2016 at 02:55:00PM +0000, Greg Aaron wrote:

> called the THIN DATA.  This is the basic data about a domain name
> registration:

> the domain name,

For an RDS query about a domain name, this is the primary key by which
the data must be fetched.  Therefore, this is a necessary condition
for an RDS at all.  It must be included.

Consequently, if someone disagrees that this is required data, that is
someone who thinks we should not have an RDS.  We should then have the
discussion about whether we should have an RDS at all.  

> the sponsoring registrar name

For an RDS query about a domain name, this data is helpful to humans
who are trying to track down the registration of the domain name.
Since the point of an RDS is to allow someone who needs certain data
about a registration to find that data, there may be an argument that
being able to find out the source of the registration could be
important.  So, that is a reason to include the data.

In addition, in a disrtibuted system (so, for instance, if we reverted
to thin registries, which a technology like RDAP makes easy), it is
necessary to get a referral to the _authoritative_ source of the data,
and since the data actually comes from registrars rather than
registries getting the sponsoring registrar is needed.  (Whether the
name is what's necessary for that is a different question; see below.)

One could argue that this data should not be included because it is
extraneous and could be looked up another way.  One could argue that
this data should not be included because it gives those who wish to do
unauthorized transfers additional information in service of that
transfer.  Registrars could argue that they don't want their domains
under management leaked (because this would allow people to harvest
numbers and profile registrar operations).

> and ID,

The arguments for and against here are the same as for the registrar
name, except for the human consumption part.  This ID is much
preferable for automatic handling of the data.

> the domain's status(es) ,

One could argue that this data needs to be public because one needs to
know whether a name ought to be working on the Internet.

One could argue that this data should not be included because most of
it is not directly relevant to whether a domain name ought to be
working.  (For instance, whether an update is pending is not
necessarily relevant to whether a name ought to resolve on the
Internet right now.)  Moreover, one could argue that at least some
status values radiate information about what a registrant may have
done, and also potentially supports attempts to game the registration
system to obtain a domain name contrary to the interests of the
previous registrant.

> created- [date]

One can argue that this data needs to be public in order that one
can understand whether the domain name one is querying about is in
fact the name that is registered.  For instance, if I want to know
about example.com that was registered in 1998, and I get a response
about a name created in 2017, then that tells me that the domain I am
naming is not in fact the same name as the one that is currently
registered.  (A way to think about this is that the RDS is an
atemporal database, but we often ask things that have an implicit
temporal reference.)

The counter-argument is that the above use is an indirect way of
achieving a unique key, and the correct response would be to use
unique IDs (perhaps Registry Object IDs or ROIDs) to uniquely identify
the domain name rather than proxying by date.

> updated- [date]

One can argue that this data needs to be public in order to aid in
troubleshooting: if a name worked an hour ago and one can see the
updated timestamp as having happened within that window, then the
troubleshooter may infer that the update may be a factor in the
failure.

The counter-argument is that this data radiates information about
actions taken on a domain name, and therefore could be used as part of
an analysis that yields PII even if it is not PII itself.

> expiration date[s]

One can argue that this data needs to be public in order to understand
whether there is an operational threat to ongoing operations.

One can argue that this data needs not to be public because it does
not directly aid Internet operations, and can provide help to those
who would attempt to game the registration system to "take over" a
domain.

> nameservers.

One can argue that this data needs to be public because it helps in
troubleshooting failures: if a domain is not working and the DNS and
registration data do not match, that may be the source of the probem.
Follow-on efforts might include finding the gap between the name
servers and registration system, waiting for the propagation time of
the registry to pass, or whatever.  Moreover, in principle if the
registry and DNS are in harmony this data is already public, so there
is no harm in including it in another public repository.

It is hard for me to come up with an argument why this should not be
public except for the case where someone thinks the RDDS is a bad idea
in general.

I hope this outline helps in narrowing the discussion about these data
elements.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com