[RSSAC Caucus] [Ext] Re: [RSS Metrics WP] INPUT REQUESTED: Performance Thresholds for Root Servers and Root Server System

Sat Sep 28 16:12:26 UTC 2019

On 9/27/19 5:04 AM, Shinta Sato wrote:

> As I've looked into each definition of the measurement and aggregation of
> the RSS related metrics to find out the thresholds, I came up with the
> strong thought that these definition need to be corrected or redefined.
> 
> Current definitions of the RSS related metrics is just a gathering of the
> outcome of the RSO metrics.  However, what we want to describe is how
> RSS as a whole looks in the measurement.
> 
> For example, about the RSS Availability, once any of the responses time
> out, the avaliability of the RSS as a whole will not be 100%.  This is
> not true, since RSS as a whole at the time of the measurement is avaliable
> if any of the 13 RSOs response to the queries.  Gathering the result of
> this determination for one day would work for the availability of RSS
> through the day.
> 
> For the RSS Response Latency, each RSOs may have the different strategy
> of the deployment of the anycast locations, and those will cover the
> whole world collectively.  The response latency for RSS as a whole seen
> from certain vantage point at certain time cannot be describe by the
> median of the response.
> It is much more better to select the minimum response time or perhaps 10
> percentile or such.

These are very germane issues with the RSS Metrics. My attempt at a summary (that could be wrong) is that there are two views of what the RSS metrics are summarizing: the average for the RSS, or the minimum for the RSS. The document currently is aimed at the average, but that may not make sense for the reasons that Sato-san raises above.

Saying that for a particular day, the RSS Availability was 99.97% sounds like the RSS was down for 0.03% of the time, which is clearly wrong. Similarly, saying that for a particular day, the RSS Response Latency was 50 milliseconds sounds like that was what was typically seen, but that too is clearly wrong because nearly all resolvers will have honed in on the RSO with the best connectivity for each resolver, and the latency they see is likely to be much below the average.

I'm not sure of the best way forward from here. We could put more explicit wording in the document about the use of averages for the RSS metrics, and the downside of using such averages, but those will probably be missed by most people seeing the published numbers. On the other hand, I don't think we can extrapolate more realistic RSS metrics from the RSO data we are collecting because we don't have vantage points at every (or even a statistically relevant) resolver on the Internet.

--Paul Hoffman