[RSSAC Caucus] Work Party: Statistical Prediction of RSS Failure

Tue Nov 10 08:37:24 UTC 2020

Fred,

Thanks for this.  Taking your last points first,

   1. Yes, I'd like to participate in this work party.

   2. Your note and the OCTO report are focused on early warning based on
   observation of anomalies in the statistics associated with the operation of
   the root server.  I interpret

   "brainstorm what a failure of the RSS might look like and investigate
   that - what does a systemic failure look like? How would one detect such a
   failure?"

   more broadly.  Examples include include classes of failure that may be
   abrupt due to common mode failure or classes of failure that are gradual a
   la the classic frog in a slowly boiling pot.  Further, a failure might have
   a purely technical cause, e.g. overload or exploit of a zero day
   vulnerability, or it might be due to financial or political weakness.
   Since your note is primarily focused on analysis of observable behavior of
   the existing root server system, it seems to me brainstorming on this be
   organized separately.

With respect to establishing an early warning system, it seems to me the
right place to start is:

   - Specify the most likely parameters to measure.  Quite obviously this
   starts with total traffic headed to root servers, but there are undoubtedly
   others.  For example, in your description of the Cloudflare outage, the
   proportion of traffic handled by Cloudflare started to drop off.  Thus, it
   would seem reasonable to compare traffic loads across the root servers and
   note the overall balance.  I'm sure there are other, more subtle
   measurements that will be useful.
   - Characterize normal operation
   - Choose plausible thresholds for the measurements of the parameters.
   - Identify past exceptional events that we would have wanted an early
   warning system to flag.
   - Check to see whether a system conforming to the first three bullets
   would have detected the exceptional events.

The above is a starting point.  With the results of that in hand, it should
be possible to think more deeply about the problem.

A few additional comments.

   - I'd like to understand the Cloudflare outage.  Can you share the
   material you have on that?
   - In setting thresholds for a warning system, it's possible to err in
   two directions.  If the thresholds are set too tightly, there will be false
   alarms.  If the thresholds are not set tightly enough, significant events
   will be missed.  I think we'd want the thresholds to be on the tight side
   with a tolerable number of false alarms.
   - The earlier root scaling study was done under great pressure, a
   situation I was partly responsible for.  The study team was conscientious
   but there wasn't enough time to really cover the topic.  Also, at the time,
   there were concerns that hundreds of thousands or even millions of new TLDs
   would be added abruptly to the root zone.  As it turned out, a relatively
   modest number of new TLDs were added to the root.  Moreover, the root zone
   expanded by a larger factor when DNSSEC was introduced. So far as I know,
   neither of these changes had much impact on the operation of any of the
   root servers.
   - A different concern that seemed to me to permeate the prior root
   scaling study was a concern among SSAC and RSSAC members that there wasn't
   a well defined path for communicating an operational concern, i.e. for
   raising an alarm and having it acted on.  I hope we're well past that
   concern.

Thanks,

Steve

On Tue, Nov 10, 2020 at 2:10 AM Fred Baker <fred at isc.org> wrote:

> Brad and I are looking at the question implied by the public call for
> comments at "Recommendations for Early Warning for Root Zone Scaling" <
> https://www.icann.org/public-comments/recommendations-early-warning-root-scaling-2020-10-05-en>.
> My sense is that we don't know how to detect the onset of a potential
> problem. We wonder whether the Caucus might help us out - looking at extant
> statistics and other data to see if there is something that might be used
> as a triggering condition. RSSAC raised this question in RSSAC031, and SSAC
> raised it in SSAC100. We have a number of years worth of RSSAC002 data
> (and, if its useful, there is also DITL data) to review in
> https://github.com/rssac-caucus/rssac002-data, or reachable from
> https://root-servers.org.
>
> A proposed approach, at least to investigating the question, would be to:
>  - start from https://github.com/rssac-caucus/rssac002-data
>  - download available RSSAC002 data (there should be data for most RSOs
> for several years)
>  - observe statistics around past burbles
>  - if something jumps out, investigate it further and document it
>
> I could imagine this being a post-doc's paper, published somewhere, but I
> do want the caucus to be able to see it pre-publication, with any necessary
> confidentiality provisions (as in, if you send a paper to this list and you
> need to keep it confidential, please say so).
>
> What we might learn is that RSSAC002 data doesn't address the issue, but
> it might by adding some new statistic to it, or we might find it by looking
> at RSSAC047 data or that plus some new statistic added to it. The obvious
> question there would be to describe, prototype, and characterize the
> indicated data.
>
> For the record, I have done things like this myself in the past.
> Cloudflare started providing an anycast service to ISC in April of a few
> years ago, and the following August took a ten day outage for reasons I
> don't recall. I downloaded the indicated statistics and stuffed them into
> an Excel spreadsheet, from which I derived a graphic. In the graphic, I was
> able to observe:
> - a stable period before the outage
> - a transition period when the outage started, during which
> request/response traffic moved to other servers
> - a stable period during the outage
> - a transition period when the outage ended, during which request/response
> traffic moved back
> - a stable period after the outage.
>
> The question would be whether we could look at several events and see if
> there is some identifiable statistical behavior that consistently predicts
> an outage.
>
> Another discussion, a little more in the direction of pins with angels
> partying on them, would be to brainstorm what a failure of the RSS might
> look like and investigate that - what does a systemic failure look like?
> How would one detect such a failure?
>
> Let's make this a work party, in the sense that I'm the work party
> shepherd and interested people can also be part. If you want to be an
> investigator in the project, please reply to this email.
> _______________________________________________
> rssac-caucus mailing list
> rssac-caucus at icann.org
> https://mm.icann.org/mailman/listinfo/rssac-caucus
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of your
> personal data for purposes of subscribing to this mailing list accordance
> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
> the website Terms of Service (https://www.icann.org/privacy/tos). You can
> visit the Mailman link above to change your membership status or
> configuration, including unsubscribing, setting digest-style delivery or
> disabling delivery altogether (e.g., for a vacation), and so on.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/rssac-caucus/attachments/20201110/27163d6f/attachment.html>