[gtld-tech] [Regops] Search Engines Indexing RDAP Server Content
John R. Levine
johnl at iecc.com
Fri Jan 29 16:30:42 UTC 2016
> So I saw a tweet from Gavin Brown (@GavinBrown) that describes how one particular search engine has indexed the RDAP server of a gTLD registry operator:
>
> https://twitter.com/GavinBrown/status/692718904058191872
>
> This is all the more reason to work on a client authentication
> specification that includes support for varying responses based on
> client identity and authorization. I've been working on such a
> specification and welcome feedback on the approach:
I don't see what the problem is. If you set up an http server that
returns interlinked data, search engines will find it and index it. All
the information RDAP returns to an unauthenticated query is presumably
public, so what's the harm in making it easier to find?
But anyway, if you don't want them to do that, there's plenty of ways to
keep them out.
The easiest is to publish a /robots.txt file. All legit search engines
will stay out if that's what it says.
Another is to look at the agent string the client sends. Google's is
googlebot, Bing's is bingbot, Yahoo's is scooter. It's easy enough to
find a list of common spider names. If the agent is a spider, tell it to
go away or redirect it to a help page.
Another is to look at the Accept: header. An RDAP client should ask for a
JSON media type. For a client that asks for html or anything else, return
an html version with meta fields in the header saying NOINDEX and
NOFOLLOW.
The big search engines spider at low speed from hosts all over the world
to avoid overloading the sites they index. You're not going to keep them
out via authentication without also keeping out everyone else who doesn't
have a password. I don't think that's a good idea.
R's,
John
More information about the gtld-tech
mailing list