[RSSAC Caucus] Work Party: Statistical Prediction of RSS Failure

Sat Nov 28 01:23:58 UTC 2020

> On Nov 10, 2020, at 12:37 AM, Steve Crocker <steve at shinkuro.com> wrote:
> 
> Fred,
> 
> Thanks for this.  Taking your last points first,
> 	• Yes, I'd like to participate in this work party.

Thanks.

> 	• Your note and the OCTO report are focused on early warning based on observation of anomalies in the statistics associated with the operation of the root server.  I interpret
> 
> "brainstorm what a failure of the RSS might look like and investigate that - what does a systemic failure look like? How would one detect such a failure?"
> 
> more broadly.  Examples include include classes of failure that may be abrupt due to common mode failure or classes of failure that are gradual a la the classic frog in a slowly boiling pot.  Further, a failure might have a purely technical cause, e.g. overload or exploit of a zero day vulnerability, or it might be due to financial or political weakness.  Since your note is primarily focused on analysis of observable behavior of the existing root server system, it seems to me brainstorming on this be organized separately. 

It may be. As much as anything, I want to know what serve as a technical basis for an early warning system for the RSS.

> With respect to establishing an early warning system, it seems to me the right place to start is:
> 	• Specify the most likely parameters to measure.  Quite obviously this starts with total traffic headed to root servers, but there are undoubtedly others.  For example, in your description of the Cloudflare outage, the proportion of traffic handled by Cloudflare started to drop off.  Thus, it would seem reasonable to compare traffic loads across the root servers and note the overall balance.  I'm sure there are other, more subtle measurements that will be useful.
> 	• Characterize normal operation
> 	• Choose plausible thresholds for the measurements of the parameters.
> 	• Identify past exceptional events that we would have wanted an early warning system to flag.
> 	• Check to see whether a system conforming to the first three bullets would have detected the exceptional events.
> The above is a starting point.  With the results of that in hand, it should be possible to think more deeply about the problem.

Well, we're on the same page, and I'm glad of that. We seem to have different viewpoints, though on the degree to which we actually know the triggering parameters; you assume that I do and can specify it, and I actually don't. If you know what parameters one might measure, I'm listening. I, personally, don't know what parameters to include or what thresholds to apply, and so am wondering whether the Caucus might be interested in helping to identify those.

I can tell you that the RSOs have an idea of how overloading of resources might be detected, in terms of thresholds on computing systems and bandwidth, and respond to overloads (detected, for example, during DDOS attacks) by augmenting routing or adding servers. But that doesn't constitute an RSS-wide measurement; it's more a response to a local issue. And for my favorite RSO, the idea of placing a server somewhere is more commonly because a potential host contacts us than because we are looking for a host in an area.

My thought is that RSSAC002 was written, at least in part, to provide data that might help with this question. I'm proposing that someone download and study it.

> A few additional comments.
> 	• I'd like to understand the Cloudflare outage.  Can you share the material you have on that?

The material is, I'm afraid, no longer available. It could be recreated, though, by doing as I suggest. Download the recorded data, and inspect it for burbles.

> 	• In setting thresholds for a warning system, it's possible to err in two directions.  If the thresholds are set too tightly, there will be false alarms.  If the thresholds are not set tightly enough, significant events will be missed.  I think we'd want the thresholds to be on the tight side with a tolerable number of false alarms.
> 	• The earlier root scaling study was done under great pressure, a situation I was partly responsible for.  The study team was conscientious but there wasn't enough time to really cover the topic.  Also, at the time, there were concerns that hundreds of thousands or even millions of new TLDs would be added abruptly to the root zone.  As it turned out, a relatively modest number of new TLDs were added to the root.  Moreover, the root zone expanded by a larger factor when DNSSEC was introduced. So far as I know, neither of these changes had much impact on the operation of any of the root servers.
> 	• A different concern that seemed to me to permeate the prior root scaling study was a concern among SSAC and RSSAC members that there wasn't a well defined path for communicating an operational concern, i.e. for raising an alarm and having it acted on.  I hope we're well past that concern.

As far as I know, we are.

> Thanks,
> 
> Steve
> 
> 
> On Tue, Nov 10, 2020 at 2:10 AM Fred Baker <fred at isc.org> wrote:
> Brad and I are looking at the question implied by the public call for comments at "Recommendations for Early Warning for Root Zone Scaling" <https://www.icann.org/public-comments/recommendations-early-warning-root-scaling-2020-10-05-en>. My sense is that we don't know how to detect the onset of a potential problem. We wonder whether the Caucus might help us out - looking at extant statistics and other data to see if there is something that might be used as a triggering condition. RSSAC raised this question in RSSAC031, and SSAC raised it in SSAC100. We have a number of years worth of RSSAC002 data (and, if its useful, there is also DITL data) to review in https://github.com/rssac-caucus/rssac002-data, or reachable from https://root-servers.org.
> 
> A proposed approach, at least to investigating the question, would be to:
>  - start from https://github.com/rssac-caucus/rssac002-data
>  - download available RSSAC002 data (there should be data for most RSOs for several years)
>  - observe statistics around past burbles
>  - if something jumps out, investigate it further and document it
> 
> I could imagine this being a post-doc's paper, published somewhere, but I do want the caucus to be able to see it pre-publication, with any necessary confidentiality provisions (as in, if you send a paper to this list and you need to keep it confidential, please say so).
> 
> What we might learn is that RSSAC002 data doesn't address the issue, but it might by adding some new statistic to it, or we might find it by looking at RSSAC047 data or that plus some new statistic added to it. The obvious question there would be to describe, prototype, and characterize the indicated data.
> 
> For the record, I have done things like this myself in the past. Cloudflare started providing an anycast service to ISC in April of a few years ago, and the following August took a ten day outage for reasons I don't recall. I downloaded the indicated statistics and stuffed them into an Excel spreadsheet, from which I derived a graphic. In the graphic, I was able to observe:
> - a stable period before the outage
> - a transition period when the outage started, during which request/response traffic moved to other servers
> - a stable period during the outage
> - a transition period when the outage ended, during which request/response traffic moved back
> - a stable period after the outage.
> 
> The question would be whether we could look at several events and see if there is some identifiable statistical behavior that consistently predicts an outage.
> 
> Another discussion, a little more in the direction of pins with angels partying on them, would be to brainstorm what a failure of the RSS might look like and investigate that - what does a systemic failure look like? How would one detect such a failure?
> 
> Let's make this a work party, in the sense that I'm the work party shepherd and interested people can also be part. If you want to be an investigator in the project, please reply to this email.
> _______________________________________________
> rssac-caucus mailing list
> rssac-caucus at icann.org
> https://mm.icann.org/mailman/listinfo/rssac-caucus
> 
> _______________________________________________
> By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.