[NCAP-Discuss] Root Cause Analysis Reports - Final Call for Comments

Tue Aug 9 16:37:31 UTC 2022

Hi Matt,

Thanks so much for going through the new section.  Let me mention a few high-level items in response, after which I'll address your concerns below, sometimes by referring to the high-level items.

1. The paper referred to by this report ("Fourteen Years in the Life..." - mentioned in the footnote) has different goals than the report itself.  And also it has different goals and methodologies than the de Vries paper.  I will speak to them in separate bullet items, but please understand that some of the items cannot be compared apples-to-apples.

2. The dataset used in the paper was not fully described in the report.  That was not intentional, but was an oversight, as I was trying to extract only the information that was pertinent to the report, without basically re-writing the paper in the report.  And in the process, I missed some things.  One critical piece that I missed was this: for each querying IP address, we saved only (up to) 13 queries.  So, for example, when we mention that we tested *all* qnames for more than one label, we were referring not to all queries, but only to the ones in the sample.  And, as mentioned in the report, we required a minimum of five queries with labels to make this determination.

3. We followed previous work very carefully.  I'll pull out a few snippets from the de Vries paper [1] to illustrate:

A. As part of describing their various qmin fingerprints, they included the following (as with all snippets, see the entire section for context):

"The third type, with variations (#3 ), is closer to the reference algorithm, but ... always [uses] the A query type instead of the NS
type as suggested by the reference algorithm."

B. After describing the fingerprints, they included the following observation/comment:

"Besides the specific signatures seen in Table 2, there are many variations of type #3. This indicates that not only do different resolvers implement different algorithms, but they also appear to be configurable or change over time (e.g. a new version changes the behavior). In total we see 20 different signatures, many of which only from one specific resolver. Interestingly, we did not observe the reference algorithm from any resolver."
(Emphasis on the last sentence.)

C. In the description of the methodology for their passive analysis applied to root servers, they wrote the following:

"For the rest of this section, and following the observations made in Section 3, we count queries as minimized if the query contains only ... 1 label (at K-Root)."

D. When they applied ground truth (i.e., known qname-minimizing resolvers) to queries, they observed the following:

"In Figure 3 we see that qmin-enabled resolvers send a median of 97% of queries classified as minimized, whereas resolvers that have not enabled this feature send only 12% of their queries classified as minimized."

E. Finally, the high-level results of their observations of minimized queries at the root:

"At K-Root we also observe an increase from 44% to 48% [from 2017 to 2018] in queries for domain names with only one label."

(Note the careful wording: "domain names with only one label.")

A few salient points about the snippets that I have included above.

First, the reference implementation was not observed by de Vries, et al.  Second, they counted qname-minimized *queries* in their analysis at the root, not *resolvers*.  Other parts of their work *did* count resolvers, but I suspect that they kept to query observations in general here, rather than resolver observations because with the root server analysis, they did not have ground truth.  Third, we followed the same methodology as they did at the root (only one label, type agnostic).  One difference was that they were looking at queries independently, whereas we were looking to classify resolvers, based on a sample of queries.  Their measurements provided guidance in that regard.  According to their observations, for half of resolvers, more than 97% of queries by qname-minimizing resolvers were minimized; or, in other words, fewer than one in 20 queries (3%) were not.  Using the sample of per-resolver queries that we had obtained, we required all of the sample to have at least one label to classify it as qname-minimizing.  In our study, for a resolver with 13 queries, if exactly one of the queries was *not* minimized, the percentage was 12/13 = 92%.  Thus, we classified resolvers with 92% or less minimized queries as non-qname-minimizing.   Could we have been a little less strict to include a few more?  Sure.  But this is a matter of probabilities, so in an effort to reduce false positives and not inflate the count, we were wanting our heuristic to provide more of an approximate lower bound.  All this explains the gap in their numbers with ours: we were counting queries by resolvers that we classified as qname-minimizing, using our heuristic approach, and they were counting single-label queries at the root.

Measurement is not an exact science, but there are guidelines.  Ground truth should be sought where it can be.  Where it cannot be obtained, reasonable heuristics should be used.  Indicators can be used to understand data, even if they are not a replacement for ground truth.  But they should not be taken as definitive.

4. The goal of section 7.3 was to see if *trends* of non-qname-minimizing resolvers matched those of the entire data set.  While our data set is imperfect, it is a reasonable heuristic, and it is based on prior contributions in the field, as explained.

[1] https://www.nlnetlabs.nl/downloads/publications/devries2019.pdf

Now...

> On Aug 8, 2022, at 5:23 PM, Thomas, Matthew <mthomas at verisign.com> wrote:
>  
> Section 7.3.1
>  
> The criteria for classifying IPs to be qname minimizing is not motivated and supported by any data or evidence. Why is the five-query threshold appropriate? Why is the 100% single label query restriction appropriate? Given what we know about recursive resolvers it seems the former is too little, and the latter is too much. 

The criteria is explained more above (see 1 and 2), but in short, it is 100% of a *sample* of queries, and it is compared to previous work on the subject.

> There is no ground truth knowledge of what a QNM resolver should look like. To determine qname minimizing behavior, a baseline needs to be established. Profiling known implementations of qname minimizing resolvers would provide an appropriate set of heuristics (Per Warren’s comments). Currently, the heuristics used provide no assurance this is an accurate selection of QNM resolvers. This newly selection criteria should be clearly stated why it doesn’t rely on existing knowledge [1], which already shows a variety of differences in QNM behavior from recursive resolvers that are not captured in this measurement.

I'm not exactly sure what you're saying.  The de Vries paper covers many facets of qname minimization, and they are very careful to distinguish which parts can be applied and compared elsewhere--and in how they apply them.  I've summarized some of those points in my introductory text of this email.  And again: 1) we used the same methodology for determining minimized queries in passive analysis as they did, which was based on the findings from their active analysis; 2) we applied the analysis of minimized queries to resolvers using metrics also from their paper; and 3) our technique is a heuristic.

> The selection criteria ignores a more selective QNM criteria defined in the RFC such as the Qtype (e.g., A and NS) and excludes multiple implementations of QNM techniques (e.g., nonce second level labels, underscore labels, asterisk labels, etc.).

See introductory text, parts 3A and 3B in particular.

> Qname minimization measurements that are broadened to the ASN are overly vague and do not reflect an accurate portrayal. A single IP that exhibits this five-query selection criteria could potentially include thousands of other querying sources that are not QNM adherent within that ASN. Furthermore, measurements at the IP level are from a name collision perspective measurement, are also misrepresented. Rapid expansion at the root via IPv6 via a few ISPs shows why such a measurement is biased and non-representative. What matters is the total leakage rate.

Sorry, I'm not sure what you are getting at here.  If you are referring to the longitudinal measurement plot, it was intended to show the % of ASNs over time with at least one qname minimizing resolver, as a deployment trend.  Nothing more, nothing less.

> The percentage of queries increasing then subsequently decreasing during the span of 2018-2021 needs to take into consideration exogenous factors.  Things such as the impacts of Chromium queries and their subsequent reduction. Especially given the selection criteria used and the properties of Chromium queries. 

Perhaps - but that is not within the scope of this report.  Some of that discussion is had in the "Fourteen Years..." paper.

> The drop in collisions in 2015 should be expected due to caching behavior and it is an unfair to compare NXD query rates to positive referral query rates.

This is, of course, unrelated to the qname minimization analysis.  But it is an interesting hypothesis that could be tested.

> A large public recursive resolver that implemented QNM, which would not be captured via the “no query with a qname having more than one label” criteria, is excluded from this analysis and represents upwards of nearly 7% of all A root traffic – which is a significant gap when the 7.3.1 figure shows only 14% of total queries are QNM in 2021 (e.g., a 50% measurement gap based on the selection criteria used).

First please see the clarifying text (namely 1 and 2) in the introductory text.  Second, please recall that the goal of this analysis was to identify minimizing vs non-minimizing *resolvers*.  It is independent of query count.

With regard to the large public resolver, I'm sorry, but this this is very vague.  Without any documentation and/or empirical analysis to go on, I have nothing to re-assess or improve my analysis.

>  
> Section 7.3.2
>  
> This section makes a measurement that is not motivated by any data or evidence that an IP address will persist at the RSS over multiple years.  While the assumptions made in this section might “greatly simplify the data and the analysis”, it does not provide any reasoning or rationale as to why it is appropriate to do so.

Please remember that 1) we are interested in *trends* over time and 2) we only need samples--not complete data--to get those trends.  The samples are taken from the resolvers identified as non-qname-minimizing in 2021.  The process and the ultimate sample sizes are well documented in the text and the table.

With regard to "reasoning or rationale" - it is the sample data we *have*.  Perhaps other data sets could have been used, I see no reason why this sample is insufficient.  If you see bias that I should be aware of, which would affect the results of the specific goals I set out to accomplish, please share.  I do not.  Finally, with regard to "evidence that an IP address will persist at the RSS over multiple years", that data is *clearly* set out in the table, which also shows sample sizes for 2018 through 2021.

Thanks,
Casey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220809/5788f47f/attachment-0001.html>