[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains

Casey Deccio casey at deccio.net
Tue Jan 25 06:25:21 UTC 2022


Dear all,

I have taken the time to study the “Perspective” document, as well as the document “Case Study of Collision Strings”, which is also being produced by NCAP in connection with Study 2.  I appreciate all the time and effort that has gone into the analysis contained in “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains”.  I know that it has required no small effort.

Nonetheless, I have fundamental concerns about the analysis contained in “Perspective”, and I also do not agree with the conclusions that are drawn from the analysis.  Additionally, I find the analysis and conclusions in “Perspective” to be at odds with those contained in “Case Study”.  Finally, I believe my concerns to be substantial enough that they cannot be corrected with minor edits, and I *do not* support the document moving forward.  I herein detail my concerns.

Sincerely,
Casey

 

Summary:
Concern 1: Analysis based on biased sample of querying IP addresses.
Concern 2: Sample data refined to support the conclusion.
Concern 3: Analysis based on biased sample of non-existent TLDs.
Concern 4:  TLDs considered without QNAME context.
Concern 5: Query count used as comparison between recursive server and root servers.
Concern 6: Unique IP addresses used as comparison between recursive server and root servers.
Concern 7: Disagrees with findings from “Case Study of Collision Strings”.


Details:


**Concern 1: Analysis based on biased sample of querying IP addresses.

The sample on which the analysis and conclusions are based is selected exclusively by proportion of queries observed during the collection period.  Specifically, fewer than 1% (0.67% or 115K) of the 17M IP addresses (15.51M IPv4 and 1.56M IPv6) observed across the 2020 DITL are considered for the analysis—those producing the most queries (90% of the DITL data); that excludes 99% of IP addresses from analysis.  Because the set of “top talker” IP addresses is selected based only on the volume of traffic, it is severely biased and is not necessarily representative of resolvers world-wide.  Those that query most—for whatever reasons—are the loudest, and without further examination, it’s hard to even know why. The concern is not even just whether or not it is okay to exclude non-top-talkers, but whether top-talkers are themselves an appropriate representation.  Other metrics that could be used to quantify network representation for the selection process and/or analysis of top-talkers are missing from the analysis, including IP prefix (e.g., /16, /24, /48, /64), ASN, and even IP version.  See also Concern 7 for more.

The analysis in Annex 2 is very interesting, but does not, by itself, resolve this concern.  The annex provides some very helpful lists of top queries from the few-queries resulting in NXDOMAIN responses, and there are some comparisons of the percentage of queries of resulting in NXDOMAIN responses for the total given number of queries, but even those are difficult to assess without a full behavioral analysis.

 
**Concern 2: Sample data refined to support the conclusion.

While the original sample is already of questionable representation (less than 1% of IP addresses observed, based solely on query volume), that dataset is further refined, according to the following text (i.e., from the document):

 “On average, each RSI observed 96% of the top talkers that account for 90% of total traffic.  That percentage drops to 94% when using the 95th percentile top talkers. Based on these findings, only the 90th percentile top talkers were used for the remaining measurements in this study.”

If the objective of the analysis is to quantify the overlap of observed query data across the root servers, and to ultimately determine whether the queries observed at one server are representative of the queries observed across all samples, then refinement of sampled IP addresses to support that conclusion is inappropriate.

 
**Concern 3: Analysis based on biased sample of non-existent TLDs.

The queries for non-existent TLDs, which result in NXDOMAIN responses at the root servers, are compared across the root servers, to see how well they are represented.  However, like observed IP addresses (Concern 1), the non-existent TLDs are limited to those corresponding to the most queries observed—both the top 10,000 and the top 1,000.  This is independent of querying IP address, ASN, and other aggregating features, which would help better understand the diversity of the queries for each non-existent TLD.  For example, it might be that the non-existent TLDs most queried for come from a small pool of IP addresses or networks, and others are being excluded simply because they are outside that sample.

 
**Concern 4:  TLDs considered without QNAME context.

While comparisons are made to measure the representativeness of non-existent TLDs, one primary feature missing from the analysis is the QNAME.  In all cases, the non-existent TLD is considered in isolation, yet QNAME context is shown in the analysis to be a significant contributor to quantifying name collisions potential (see Concern 7).

 
**Concern 5: Query count used as comparison between recursive server and root servers.

Because of (negative) caching at recursive servers, it is expected that queries observed at the root servers for a given non-existent TLD will be fewer than those at a recursive resolver for that same non-existent TLD.  It is this very caching behavior that makes the comparison of query count for a given non-existent TLD, as observed by the root servers vs. a recursive resolver, an apples-to-oranges comparison.  Yet the analysis includes a comparison of the top 1,000 non-existent TLDs, ranked by query count.  Thus, no meaningful conclusions can be drawn from this comparison.

 
**Concern 6: Unique IP addresses used as comparison between recursive server and root servers.

Study 2 includes source diversity when comparing the query counts for non-existent TLDs.  There is certainly more value in investigating IP source diversity when considering the query counts for non-existent TLDs that considering query counts alone (Concern 5).  However, it is expected that recursive resolvers serve a very different client base than authoritative servers, specifically the root servers.  Whereas the former would might expect queries from stub resolvers, the latter might expect queries from recursive resolvers.  In such a case, analyzing client IP addresses independently of one another leaves significant meaningful context out, such as the diversity of IP prefixes or ASNs from which queries arrive.  A large number of IP addresses from the same IP prefix or ASN might be responsible for the queries associated with several “top” non-existent TLDs, excluding non-existent TLDs that might have non-trivial presence but do not have the top IP address diversity.  See also Concern 7.


**Concern 7: Disagrees with findings from “Case Study of Collision Strings”.

The document “Case Study of Collision Strings”, also written in connection with NCAP Study 2, contains the following findings:

1.     “A relatively small number of origin ASNs account for the vast majority of query traffic for .CORP, .HOME, and .MAIL. In all cases roughly 200 ASNs make up nearly 90% of the volume” (section 4.1.5).

2.     “Label analysis provides a unique observational context into the underlying systems, networks, and protocols inducing leakage of DNS queries to the global DNS ecosystem. Understanding the diversity of labels can help provide a sense of how broadly disseminated the leakage is throughout the DNS” (section 4.2.1).

3.     “The .CORP SLDs seen at both A and J (approximately 16 thousand) is almost equal to those seen at A-root alone, but J-root sees over 30,000 .CORP SLDs that A-root does not see” (section 4.3.1).

4.     “Across all names studied, while A and J saw much in common, there was a non-negligible amount of uniqueness to each view. For example, A and J each saw queries from the same 5717 originating ASNs, but J saw 2477 ASNs that A didn't see and A saw 901 that didn't see” (section 4.3.2).

5.     “A more intensive and thorough analysis would include other root server vantage points to minimize potential bias in the A and J catchments” (section 5.2).

6.     “Additional measurement from large recursive resolvers would also help elucidate any behaviors masked by negative caching and the population of stub resolvers” (section 5.2).


These findings emphasize the following points, which are at odds with the "Perspective" document:

-       Including ASN (and IP prefix) in an analysis can make a significant difference in the overall diversity associated with observed queries.

-       There is significance in the context provided by the QNAME, not only in measuring diversity, but also in query representativeness across root servers.

-       Root servers—even just A and J—have a non-negligible amount of uniqueness that is not captured—or even addressed in this document.

-       More root servers have a greater perspective of potential name collisions than one.

-       The population of stub resolvers should be considered in the analysis of large recursive resolvers.



> On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss <ncap-discuss at icann.org> wrote:
> 
> NCAP DG,
> As set during our last meeting on 19 January 2022, we pushed the start of the public comment period for “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains” to this Thursday, 27 January 2022, in order to accommodate some last minute questions. Additionally, as previously announced, today ends the comment period for the release of this document.
> Attached is the FINAL DRAFT version of “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains”. If you have any objections to this document being released for public comment please reply to this message on the list. The objection period will close at the end of our weekly meeting on Wednesday, 26 January 2022. Comments that do not substantially change our stated conclusions will be captured and considered after the public comment period when we will be reviewing all public comments received.
> A view only version of the document is here:https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view# <https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#>
> Matt Thomas
> <Last Call A Perspective Study of DNS Queries for Non-Existent Top-Level Domains.pdf>_______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org <mailto:NCAP-Discuss at icann.org>
> https://mm.icann.org/mailman/listinfo/ncap-discuss <https://mm.icann.org/mailman/listinfo/ncap-discuss>
> 
> _______________________________________________
> By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy <https://www.icann.org/privacy/policy>) and the website Terms of Service (https://www.icann.org/privacy/tos <https://www.icann.org/privacy/tos>). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220124/1f0b040a/attachment-0001.html>


More information about the NCAP-Discuss mailing list