[RSSAC Caucus] Work Party: Statistical Prediction of RSS Failure

Fri Nov 13 14:10:41 UTC 2020

Greetings colleagues,

Yes I will take part in this.

Regards
Ignatius

Ignatius G.K. Nkrumah (BSc. IT/SE) 
Director & Senior Developer | Imovicon Pty Ltd 

  +27 (0) 800 014 778           +27 (0) 87 551 3264   ext. 201 
  ignatius at imovicon.com
  Fancourt Office Park, Building 4, 1st Floor Left G. 
     Northumberland And Felstead Road Northriding, 
     Roodepoort, 2169, South Africa
  www.imovicon.com 

Trees have feelings too, please consider the environment before you print.
Protect • Connect • Grow
The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of Imovicon Pty Ltd. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future. Visit our website at www.imovicon.com to read more on our privacy policy.

-----Original Message-----
From: rssac-caucus <rssac-caucus-bounces at icann.org> On Behalf Of Fred Baker
Sent: Thursday, 12 November 2020 23:08
To: Steve Crocker <steve at shinkuro.com>
Cc: RSSAC Caucus <rssac-caucus at icann.org>
Subject: Re: [RSSAC Caucus] Work Party: Statistical Prediction of RSS Failure

> On Nov 10, 2020, at 12:37 AM, Steve Crocker <steve at shinkuro.com> wrote:
> 
> Fred,
> 
> Thanks for this.  Taking your last points first,
> 	• Yes, I'd like to participate in this work party.

Thanks

> 	• Your note and the OCTO report are focused on early warning based on observation of anomalies in the statistics associated with the operation of the root server.  I interpret
> 
> "brainstorm what a failure of the RSS might look like and investigate that - what does a systemic failure look like? How would one detect such a failure?"
> 
> more broadly.  Examples include include classes of failure that may be abrupt due to common mode failure or classes of failure that are gradual a la the classic frog in a slowly boiling pot.  Further, a failure might have a purely technical cause, e.g. overload or exploit of a zero day vulnerability, or it might be due to financial or political weakness.  Since your note is primarily focused on analysis of observable behavior of the existing root server system, it seems to me brainstorming on this be organized separately. 

> With respect to establishing an early warning system, it seems to me the right place to start is:
> 	• Specify the most likely parameters to measure.

Well, yes, but... We are in the process of voting on this (and therefore haven't filed it with OCTO yet), but we are preparing to make this statement on the OCTO comment on the Early Warning System: https://docs.google.com/document/d/1BirFjVpz3A1byiTTlhktOdpwTc6hrw4igU-aK5jKOIU/edit?pli=1

The third paragraph is particularly relevant. We have a large amount of statistical data available, which I pointed to in my note the other day. However, we don't now what the most likely parameters are. Hence, the fundamental request in my email was "could someone smarter than me take a look at the data we have and recommend parameters to look at?" Obvious parameters might include the traffic volume measurements in RSSAC002, but it's possible that we don't have a metric on exactly the critical parameter, which is why I said "well, if someone suggested such a metric and made an argument for it...".

> Quite obviously this starts with total traffic headed to root servers, but there are undoubtedly others.  For example, in your description of the Cloudflare outage, the proportion of traffic handled by Cloudflare started to drop off.  Thus, it would seem reasonable to compare traffic loads across the root servers and note the overall balance.  I'm sure there are other, more subtle measurements that will be useful.
> 	• Characterize normal operation
> 	• Choose plausible thresholds for the measurements of the parameters.
> 	• Identify past exceptional events that we would have wanted an early warning system to flag.
> 	• Check to see whether a system conforming to the first three bullets would have detected the exceptional events.
> The above is a starting point.  With the results of that in hand, it should be possible to think more deeply about the problem.

Yes, and I might hope that a researcher studying this might do that. I can generally say that traffic to and from the RSOs tends to be pretty stable over any given 24 hour period.

> A few additional comments.
> 	• I'd like to understand the Cloudflare outage.  Can you share the material you have on that?

The data is long since deleted. Sorry. In any event, my recollection of the graph was as I described - almost flat, a transition, almost flat, a transition, and almost flat. In the statistics online, Cloudflare isn't separately called out; what changed was F Root's proportion of the total traffic. So it isn't quite as simple as it perhaps should be.

> 	• In setting thresholds for a warning system, it's possible to err in two directions.  If the thresholds are set too tightly, there will be false alarms.  If the thresholds are not set tightly enough, significant events will be missed.  I think we'd want the thresholds to be on the tight side with a tolerable number of false alarms.
> 	• The earlier root scaling study was done under great pressure, a situation I was partly responsible for.  The study team was conscientious but there wasn't enough time to really cover the topic.  Also, at the time, there were concerns that hundreds of thousands or even millions of new TLDs would be added abruptly to the root zone.  As it turned out, a relatively modest number of new TLDs were added to the root.  Moreover, the root zone expanded by a larger factor when DNSSEC was introduced. So far as I know, neither of these changes had much impact on the operation of any of the root servers.
> 	• A different concern that seemed to me to permeate the prior root scaling study was a concern among SSAC and RSSAC members that there wasn't a well defined path for communicating an operational concern, i.e. for raising an alarm and having it acted on.  I hope we're well past that concern.
> Thanks,
> 
> Steve
> 
> 
> On Tue, Nov 10, 2020 at 2:10 AM Fred Baker <fred at isc.org> wrote:
> Brad and I are looking at the question implied by the public call for comments at "Recommendations for Early Warning for Root Zone Scaling" <https://www.icann.org/public-comments/recommendations-early-warning-root-scaling-2020-10-05-en>. My sense is that we don't know how to detect the onset of a potential problem. We wonder whether the Caucus might help us out - looking at extant statistics and other data to see if there is something that might be used as a triggering condition. RSSAC raised this question in RSSAC031, and SSAC raised it in SSAC100. We have a number of years worth of RSSAC002 data (and, if its useful, there is also DITL data) to review in https://github.com/rssac-caucus/rssac002-data, or reachable from https://root-servers.org.
> 
> A proposed approach, at least to investigating the question, would be to:
>  - start from https://github.com/rssac-caucus/rssac002-data
>  - download available RSSAC002 data (there should be data for most RSOs for several years)
>  - observe statistics around past burbles
>  - if something jumps out, investigate it further and document it
> 
> I could imagine this being a post-doc's paper, published somewhere, but I do want the caucus to be able to see it pre-publication, with any necessary confidentiality provisions (as in, if you send a paper to this list and you need to keep it confidential, please say so).
> 
> What we might learn is that RSSAC002 data doesn't address the issue, but it might by adding some new statistic to it, or we might find it by looking at RSSAC047 data or that plus some new statistic added to it. The obvious question there would be to describe, prototype, and characterize the indicated data.
> 
> For the record, I have done things like this myself in the past. Cloudflare started providing an anycast service to ISC in April of a few years ago, and the following August took a ten day outage for reasons I don't recall. I downloaded the indicated statistics and stuffed them into an Excel spreadsheet, from which I derived a graphic. In the graphic, I was able to observe:
> - a stable period before the outage
> - a transition period when the outage started, during which request/response traffic moved to other servers
> - a stable period during the outage
> - a transition period when the outage ended, during which request/response traffic moved back
> - a stable period after the outage.
> 
> The question would be whether we could look at several events and see if there is some identifiable statistical behavior that consistently predicts an outage.
> 
> Another discussion, a little more in the direction of pins with angels partying on them, would be to brainstorm what a failure of the RSS might look like and investigate that - what does a systemic failure look like? How would one detect such a failure?
> 
> Let's make this a work party, in the sense that I'm the work party shepherd and interested people can also be part. If you want to be an investigator in the project, please reply to this email.
> _______________________________________________
> rssac-caucus mailing list
> rssac-caucus at icann.org
> https://mm.icann.org/mailman/listinfo/rssac-caucus
> 
> _______________________________________________
> By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.

_______________________________________________
rssac-caucus mailing list
rssac-caucus at icann.org
https://mm.icann.org/mailman/listinfo/rssac-caucus

_______________________________________________
By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.