[ICANN-CSC] [Ext] Re: July 2023 IANA Naming Function Performance Report

Kim Davies kim.davies at iana.org
Mon Aug 14 17:00:13 UTC 2023


Hi Rick,

Thinking back to the original discussions in the design team that first created the SLAs, one of the original drivers was to ensure 100% coverage of IANA time in the metrics even if it meant capturing too much. One sentiment expressed in the original discussion was if there was any time unaccounted for, it would tempt IANA to sneak in delays in that unmeasured window. This is why we have some very small measurement windows for things that take milliseconds, on the notion that if it wasn’t measured then that would create a loophole. (To be clear, I don’t think IANA is incentivized to make use of such loopholes if they exist!)

For the technical checks, if you were to apply the principle more faithfully, it would likely break individual test runs into hundreds of constituent pieces where fractions of a second are measured of “IANA time” creating and dispatching network queries, followed by customer time sending back a response to the query. There was a practical realization that tracking the ebb and flow of all these various tests was too complex so it ended up being considered all as “IANA time”, on the notion that the overall performance metric should be generous enough to cater for customer induced delays, but also on the basis that IANA would be motivated to work out how to optimize further to compensate for customer delays.

I think the failure against these metrics is most apparent in the “retest” metric, because by definition these are only in cases where the TLDs are already having tech check problems, and retries of the test run are happening. It is much harder to cancel out the aberrations with timely tests like with the “first” tech check run.

I don’t know the right answer to the question on how to evolve it. I will say that since 2016 we have realized significant performance gains in conducting these tests, so having the tests measured this way has been a driver for meaningful improvement. All the quick wins have been implemented though. While there are still some optimization possibilities we see, it is diminishing returns and further optimizations potentially make our code a lot more complex and potentially brittle.

Another factor is that late last year we started consulting with the technical community on how to evolve the technical checks to meet with modern requirements (the current tests were specified in 2007). The general direction we are hearing is that IANA should be testing for more things. Assuming the number of tests grows, you can only see the duration for the pathological cases only going up, not down.

kim


From: Rick Wilhelm <Rwilhelm at PIR.org>
Date: Thursday, August 10, 2023 at 7:57 AM
To: Kim Davies <kim.davies at iana.org>, Amy Creamer <amy.creamer at iana.org>, Bart Boswinkel via ICANN-CSC <icann-csc at icann.org>
Subject: [Ext] Re: [ICANN-CSC] July 2023 IANA Naming Function Performance Report

Thanks Kim, for the detail about the test.

I think that the point that I’m trying to make is that the IANA SLAs are there to measure the effectiveness of the IANA function and things that are within its responsibility.

In this case, the timer on the SLA is inclusive of time that is outside of the IANA demarcation of responsibility.  And thus, as we’ve seen, IANA can be doing a perfectly good job (even doing a fair bit of parallelism), and because of non-responsiveness the server on the other end.

I’m wondering if there is some sort of a factor that can/should be added to account for latency that is not due to IANA?

Thx
Rick


From: Kim Davies <kim.davies at iana.org>
Date: Wednesday, August 9, 2023 at 3:41 PM
To: Rick Wilhelm <Rwilhelm at PIR.org>, Amy Creamer <amy.creamer at iana.org>, Bart Boswinkel via ICANN-CSC <icann-csc at icann.org>
Subject: [EXTERNAL] Re: [ICANN-CSC] July 2023 IANA Naming Function Performance Report
CAUTION: This email came from outside your organization. Don’t trust emails, links, or attachments from senders that seem suspicious or you are not expecting.
________________________________
Hi Rick, Hi all,

Generally speaking, a lack of network response from one or more nameservers has a compounding effect across a test run in our systems. Assuming that most nameservers have both an IPv4 and IPv6 address, this means that we would send four queries (we test each IP address via both TCP and UDP), and then we retry them 3 times before giving up for each query. The current timeout is set to 5 seconds, so this means a minimum of 60 seconds of test time eaten up for each unresponsive nameserver per sub-test. There are efficiencies through parallel execution of the tests, but that is offset by the fact we query for different kinds of records throughout a test run (SOA, DNSKEY, NS, A/AAAA, RD-bit set). In more pathological cases, if there are multiple nameservers all in the same network that are unreachable, it can multiply the test time further.

I looked at one test run from the individual case that caused us to exceed our SLAs last month and there were a total of 316 DNS queries sent that were not responded to throughout the course of that one test run, which took 14 minutes and 40 seconds to complete.

Thanks for the reference to Spec 10, it seems the pertinent standard is <1500ms response for TCP and <500ms response for UDP, for 95% of queries. Since our lookups are a one-shot real-time blocking operation, as opposed to passive ongoing tests done around the clock for gTLD SLA monitoring, it is a bit of a different proposition. Also I understand ICANN does SLA monitoring for multiple sites and aggregates the performance, which is not something we do in IANA today.

Happy to discuss this further.

kim



From: ICANN-CSC <icann-csc-bounces at icann.org> on behalf of Rick Wilhelm via ICANN-CSC <icann-csc at icann.org>
Reply-To: Rick Wilhelm <Rwilhelm at PIR.org>
Date: Wednesday, August 9, 2023 at 4:38 AM
To: Amy Creamer <amy.creamer at iana.org>, Bart Boswinkel via ICANN-CSC <icann-csc at icann.org>, Bart Boswinkel via ICANN-CSC <icann-csc at icann.org>
Subject: Re: [ICANN-CSC] July 2023 IANA Naming Function Performance Report

Amy, et al,

Thanks for sending over the report.  This might be a better topic for discussion at the next meeting meting than for the list, but I’ll try to frame it coherently:

Regarding the missed SLA:

From what I can gather in reading the footnote, it seems that during the execution of the “One request (that) exceeded the technical check threshold of 10 minutes”, operations happened normally (i.e. kicked off normally, experienced no IANA-induced interruptions, etc), and the extra time was spent waiting.

It seems to me that the “technical check (retest)” should be designed with a DNS query timeout value that is sufficiently short such that waiting for the timeout(s) to expire does not cause the SLA to be violated.

I’m not familiar with the exact details of the test design, but I’d point folks to the Base Registry Agreement, Specification 10 (https://itp.cdn.icann.org/en/files/registry-agreements/base-registry-agreement-30-04-2023-en.html [itp.cdn.icann.org] [protect-us.mimecast.com]<https://urldefense.com/v3/__https:/protect-us.mimecast.com/s/u0LtCzpyrXSw8lwSXxy4D?domain=urldefense.com__;!!PtGJab4!7h8_obiyzfoODB9JnZyAMQXnEulacUCIJK4JWApuYJPd7QqfUSyjPtjvEnK9CjxuZrXr1QH50e54H0BaSEKFfXY$>) search for “will be considered unanswered” for examples of language that contemplate non-responsive services

Happy to discuss.

Thanks
Rick




From: ICANN-CSC <icann-csc-bounces at icann.org> on behalf of Amy Creamer via ICANN-CSC <icann-csc at icann.org>
Date: Tuesday, August 8, 2023 at 1:39 PM
To: Bart Boswinkel via ICANN-CSC <icann-csc at icann.org>
Subject: [EXTERNAL] [ICANN-CSC] July 2023 IANA Naming Function Performance Report
CAUTION: This email came from outside your organization. Don’t trust emails, links, or attachments from senders that seem suspicious or you are not expecting.
________________________________
Dear CSC,

Please find attached the IANA Naming Function Performance report for July 2023. During the month of July, we met 98.3% of the SLA thresholds.  This was due to missing the SLA of:

Technical Check (Retest) - Routine (Technical): One change request had nameservers that were unreachable within the technical check threshold of 10 minutes.  This exception relates to time spent waiting for nameserver responses, i.e. time waiting to timeout, multiplied by retries.

We look forward to answering any questions you may have about the report.


Regards,

Amy Creamer
Director of Operations, IANA Services
Email: amy.creamer at iana.org<mailto:amy.creamer at iana.org>
Phone: +1-424-537-8917
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/icann-csc/attachments/20230814/c7321e59/attachment-0001.html>


More information about the ICANN-CSC mailing list