[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains

Wed Jan 26 17:55:09 UTC 2022

Dear NCAP DG,

Casey - Thank you for the comments.

First, reviewing the context and genesis of these two documents needs to be reestablished.  The CORP, HOME, and MAIL case studies were conducted to directly facilitate one of the Board questions regarding those strings that necessitated the creation of the NCAP DG. During those studies, the group established a set of Critical Diagnostic Measurements for analyzing and assessing name collisions within a specific non-existent TLD.  These include things such as query volume, network diversity (/24, ASN, etc.), second level domain diversity, specific query label prefixes (WPAD, DNS-SD, etc.), etc.

Parallel to that research, the DG had numerous conversations as to how new TLD applicants can be better informed about potential name collision issues prior to their application. The 2012 round relied on DITL data, which is typically only an annual event, and the data is not generally easily accessed/analyzed/published for or by applicants. Looking at alternative ways to get more data to applicants in a more digestible form, NCAP talked to ITHI about updating their name collision measurements and talked about how ICANN’s L-root could be potentially well positioned to extend their current name collision measurements – which they actually already have[1].

But as stated in the comments and is well known both ITHI and the L-root name collision data is only a subset of the RSS and DNS ecosystem. To that end, the second study “Perspective” was performed to at least get a rough measurement as to how representative name collision traffic is at various parts of the DNS. This study was conducted to fulfill the NCAP Study 2 scope and to answer Board questions in a prudent way, in such that it helps support and provide rough guardrails about how complete top-N non-existent TLD lists (either ranked by volume or source diversity) would be to the applicants. In other words, to understand what caveats applicants should know about when looking at these types of lists (e.g. if the applicant’s string is seen on these top-N lists, it definitely has some name collision issues; however, if the applicant’ string is not seen, it doesn’t guarantee there are no name collisions, given the constrained ).

The major theme of comments seems to be on two primary concerns: 1.) Using a threshold limit to identify top talkers and 2.) PRR to RSS data measurements are not apples to apples.

The first concern focuses on the measurements established in “Perspective” using the 90th percentile of total traffic to identify source IP addresses to compare between RSIs. This is a threshold based on query volume, but it constitutes 90% of total traffic. It is unclear as to how something that constitutes 90% of total traffic is a “biased subset”.

An analysis was presented in Annex 2 that further examines the long tail of low querying IP addresses, identifies a different threshold level based on behavioral data, and reassesses the entire RSI to RSI similarity measurement. The findings from that continue to support that any RSI is largely representative of what is happening at the RSS and supports the goal of this work to recommend that top-N lists be published by sources with the caveats previously mentioned and identified in this work.

This reexamination showed that most low querying source IP addresses were for root priming queries, RFC 8145 trust anchor signals, domains under the qq.com domain, queries for delegated TLDs, and a sprinkle of DNS-SD queries with strange encodings that accounted for the vast majority of NXDomain requests (besides RFC 8145 signals).  Those queried names are not representative of what the general RSS observes nor are they relevant to name collisions or the scope of the NCAP DG.

Further examination of those low querying source IP addresses shows that they _behave _very _differently than ~92% of the source IPs contributing to 2020 DITL. They receive a very low amount of NXDomain responses (sub 10%) and query for a large amount of the previously mentioned query names. Further examination of the entire IP source addresses based on query volume and their response code clearly depict a dichotomy in _behavior between low query sources and the sources that make up the majority of RSS traffic (Annex 2 Figure 1 and 2).

Using this new behavioral insight, a new conservative threshold minimum of 1,000 queries per source IP address was selected. This new gating of IP addresses results in the inclusion of the top 1.3M IPs and represents 98% of total 2020 DITL queries. Using this new set of source IPs, RSI similarity measurements were recalculated. The overall pairwise similarity measured 0.86 (compared to the previous .96). This data continues to support the goal of this work to establish a general broad stroke statement that top-N lists provided by sources are generally representative – and based on data available to / via ICANN org in support of the new gTLD program.

It is unclear as to what a behavioral analysis is and how that analysis would help inform or refine the current measurements or how such an analysis would also avoid using threshold limits. Based on the data presented in Annex 2, further refining of the source IP address that constitute only 2% of total traffic will have negligible impact to the results presented. That amount of error can be noted and documented in a residual risk registry.  The excluded 2% of source IPs are not material in the terms of total query volume or behavior and certainly not material for the broad stroke goal of supporting the top-N lists for applicants.

The second concern focuses on the appropriateness of comparing PRR and RSS top non-existent TLDs. Due to PII and other data sensitivity concerns, obtaining PRR data was extremely challenging. Based on the willingness to only share very highly aggregated data, query volume and source diversity were the only data provided by the PRR. This is the first time PRR has been available (even at this very coarse level) for name collision analysis purposes. The underlying fact that notable differences are seen and documented is a first within the context of name collisions. That finding alone is important information for the NCAP and its evaluation of how to answer the Board’s questions.

In spirit of progressing NCAP DG objectives and answering the Board questions, any dissenting views will have an opportunity to be captured.

Matt
[1] https://ithi.research.icann.org/rarends/

From: NCAP-Discuss <ncap-discuss-bounces at icann.org> on behalf of Casey Deccio <casey at deccio.net>
Date: Tuesday, January 25, 2022 at 1:25 AM
To: "ncap-discuss at icann.org" <ncap-discuss at icann.org>
Subject: [EXTERNAL] Re: [NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains

Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Dear all,

I have taken the time to study the “Perspective” document, as well as the document “Case Study of Collision Strings”, which is also being produced by NCAP in connection with Study 2.  I appreciate all the time and effort that has gone into the analysis contained in “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains”.  I know that it has required no small effort.

Nonetheless, I have fundamental concerns about the analysis contained in “Perspective”, and I also do not agree with the conclusions that are drawn from the analysis.  Additionally, I find the analysis and conclusions in “Perspective” to be at odds with those contained in “Case Study”.  Finally, I believe my concerns to be substantial enough that they cannot be corrected with minor edits, and I *do not* support the document moving forward.  I herein detail my concerns.

Sincerely,
Casey

Summary:
Concern 1: Analysis based on biased sample of querying IP addresses.
Concern 2: Sample data refined to support the conclusion.
Concern 3: Analysis based on biased sample of non-existent TLDs.
Concern 4:  TLDs considered without QNAME context.
Concern 5: Query count used as comparison between recursive server and root servers.
Concern 6: Unique IP addresses used as comparison between recursive server and root servers.
Concern 7: Disagrees with findings from “Case Study of Collision Strings”.

Details:

**Concern 1: Analysis based on biased sample of querying IP addresses.

The sample on which the analysis and conclusions are based is selected exclusively by proportion of queries observed during the collection period.  Specifically, fewer than 1% (0.67% or 115K) of the 17M IP addresses (15.51M IPv4 and 1.56M IPv6) observed across the 2020 DITL are considered for the analysis—those producing the most queries (90% of the DITL data); that excludes 99% of IP addresses from analysis.  Because the set of “top talker” IP addresses is selected based only on the volume of traffic, it is severely biased and is not necessarily representative of resolvers world-wide.  Those that query most—for whatever reasons—are the loudest, and without further examination, it’s hard to even know why. The concern is not even just whether or not it is okay to exclude non-top-talkers, but whether top-talkers are themselves an appropriate representation.  Other metrics that could be used to quantify network representation for the selection process and/or analysis of top-talkers are missing from the analysis, including IP prefix (e.g., /16, /24, /48, /64), ASN, and even IP version.  See also Concern 7 for more.

The analysis in Annex 2 is very interesting, but does not, by itself, resolve this concern.  The annex provides some very helpful lists of top queries from the few-queries resulting in NXDOMAIN responses, and there are some comparisons of the percentage of queries of resulting in NXDOMAIN responses for the total given number of queries, but even those are difficult to assess without a full behavioral analysis.

**Concern 2: Sample data refined to support the conclusion.

While the original sample is already of questionable representation (less than 1% of IP addresses observed, based solely on query volume), that dataset is further refined, according to the following text (i.e., from the document):

 “On average, each RSI observed 96% of the top talkers that account for 90% of total traffic.  That percentage drops to 94% when using the 95th percentile top talkers. Based on these findings, only the 90th percentile top talkers were used for the remaining measurements in this study.”

If the objective of the analysis is to quantify the overlap of observed query data across the root servers, and to ultimately determine whether the queries observed at one server are representative of the queries observed across all samples, then refinement of sampled IP addresses to support that conclusion is inappropriate.

**Concern 3: Analysis based on biased sample of non-existent TLDs.

The queries for non-existent TLDs, which result in NXDOMAIN responses at the root servers, are compared across the root servers, to see how well they are represented.  However, like observed IP addresses (Concern 1), the non-existent TLDs are limited to those corresponding to the most queries observed—both the top 10,000 and the top 1,000.  This is independent of querying IP address, ASN, and other aggregating features, which would help better understand the diversity of the queries for each non-existent TLD.  For example, it might be that the non-existent TLDs most queried for come from a small pool of IP addresses or networks, and others are being excluded simply because they are outside that sample.

**Concern 4:  TLDs considered without QNAME context.

While comparisons are made to measure the representativeness of non-existent TLDs, one primary feature missing from the analysis is the QNAME.  In all cases, the non-existent TLD is considered in isolation, yet QNAME context is shown in the analysis to be a significant contributor to quantifying name collisions potential (see Concern 7).

**Concern 5: Query count used as comparison between recursive server and root servers.

Because of (negative) caching at recursive servers, it is expected that queries observed at the root servers for a given non-existent TLD will be fewer than those at a recursive resolver for that same non-existent TLD.  It is this very caching behavior that makes the comparison of query count for a given non-existent TLD, as observed by the root servers vs. a recursive resolver, an apples-to-oranges comparison.  Yet the analysis includes a comparison of the top 1,000 non-existent TLDs, ranked by query count.  Thus, no meaningful conclusions can be drawn from this comparison.

**Concern 6: Unique IP addresses used as comparison between recursive server and root servers.

Study 2 includes source diversity when comparing the query counts for non-existent TLDs.  There is certainly more value in investigating IP source diversity when considering the query counts for non-existent TLDs that considering query counts alone (Concern 5).  However, it is expected that recursive resolvers serve a very different client base than authoritative servers, specifically the root servers.  Whereas the former would might expect queries from stub resolvers, the latter might expect queries from recursive resolvers.  In such a case, analyzing client IP addresses independently of one another leaves significant meaningful context out, such as the diversity of IP prefixes or ASNs from which queries arrive.  A large number of IP addresses from the same IP prefix or ASN might be responsible for the queries associated with several “top” non-existent TLDs, excluding non-existent TLDs that might have non-trivial presence but do not have the top IP address diversity.  See also Concern 7.

**Concern 7: Disagrees with findings from “Case Study of Collision Strings”.

The document “Case Study of Collision Strings”, also written in connection with NCAP Study 2, contains the following findings:

1.     “A relatively small number of origin ASNs account for the vast majority of query traffic for .CORP, .HOME, and .MAIL. In all cases roughly 200 ASNs make up nearly 90% of the volume” (section 4.1.5).

2.     “Label analysis provides a unique observational context into the underlying systems, networks, and protocols inducing leakage of DNS queries to the global DNS ecosystem. Understanding the diversity of labels can help provide a sense of how broadly disseminated the leakage is throughout the DNS” (section 4.2.1).

3.     “The .CORP SLDs seen at both A and J (approximately 16 thousand) is almost equal to those seen at A-root alone, but J-root sees over 30,000 .CORP SLDs that A-root does not see” (section 4.3.1).

4.     “Across all names studied, while A and J saw much in common, there was a non-negligible amount of uniqueness to each view. For example, A and J each saw queries from the same 5717 originating ASNs, but J saw 2477 ASNs that A didn't see and A saw 901 that didn't see” (section 4.3.2).

5.     “A more intensive and thorough analysis would include other root server vantage points to minimize potential bias in the A and J catchments” (section 5.2).

6.     “Additional measurement from large recursive resolvers would also help elucidate any behaviors masked by negative caching and the population of stub resolvers” (section 5.2).

These findings emphasize the following points, which are at odds with the "Perspective" document:

-       Including ASN (and IP prefix) in an analysis can make a significant difference in the overall diversity associated with observed queries.

-       There is significance in the context provided by the QNAME, not only in measuring diversity, but also in query representativeness across root servers.

-       Root servers—even just A and J—have a non-negligible amount of uniqueness that is not captured—or even addressed in this document.

-       More root servers have a greater perspective of potential name collisions than one.

-       The population of stub resolvers should be considered in the analysis of large recursive resolvers.

On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss <ncap-discuss at icann.org<mailto:ncap-discuss at icann.org>> wrote:

NCAP DG,
As set during our last meeting on 19 January 2022, we pushed the start of the public comment period for “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains” to this Thursday, 27 January 2022, in order to accommodate some last minute questions. Additionally, as previously announced, today ends the comment period for the release of this document.
Attached is the FINAL DRAFT version of “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains”. If you have any objections to this document being released for public comment please reply to this message on the list. The objection period will close at the end of our weekly meeting on Wednesday, 26 January 2022. Comments that do not substantially change our stated conclusions will be captured and considered after the public comment period when we will be reviewing all public comments received.
A view only version of the document is here:https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#<https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view>
Matt Thomas
<Last Call A Perspective Study of DNS Queries for Non-Existent Top-Level Domains.pdf>_______________________________________________
NCAP-Discuss mailing list
NCAP-Discuss at icann.org<mailto:NCAP-Discuss at icann.org>
https://mm.icann.org/mailman/listinfo/ncap-discuss<https://secure-web.cisco.com/12_mDGdXYM1CCWM1W9GkTmAV0fJZ3GoI3tInJgeAB6oIuyQTVqwOOGk1PwfSVa376J9tKbh-QezMMCEKY2V-JAi_DrXZtPUztK9ppzoRJTfnTZuMtwb8u9VXSymmHPE2OXPnXKq2asgrFaqPp3aQbbxg41YHToZwcYz2IIKxaFQrvf9B9E1DHyDUXAAzhypnrw-gn_fdKmhZAluGdl4kGnw6n0YgQ2CMhUWd0aAIcT9T5nXYfr2b475QcRgDA7Rpy/https%3A%2F%2Fmm.icann.org%2Fmailman%2Flistinfo%2Fncap-discuss>

_______________________________________________
By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy<https://secure-web.cisco.com/1vdHtLMLG3K2_yvqdwj7IlY0Rk_xwhk_tPS-OYYhtTpTYReC85QLebKekftC8npMEomvhmUau0G8-fMLIiDw6GlnU-RTfavryvr_3S93tOT4IKv6KoBTLX4lPxW7efMPSqLyg2tr7jWuPSUhe1_ohH02-d9L7EzLwxF05Pqfgvis_eRX9rxJfu7YO2nSnNh8P902GNRBG5hSujrR88UTYjIiCaNqNRX13OlGxKGxqUyo9fMBAMwAirIqbgF0pRS8H/https%3A%2F%2Fwww.icann.org%2Fprivacy%2Fpolicy>) and the website Terms of Service (https://www.icann.org/privacy/tos<https://secure-web.cisco.com/1tnXCped8HjJfqHbQpWLTTcqzmlBp5uOdAgjFR2bnjX4UpdH4D9W02j6ulk262v3LGeLEk00E5SQUcol-ceAgi5LjBusIBb_iREI1lClS7Q3WHXDPNmIjHyEmOxcQjP-u4rJs16vgAeK91CEqtr3UgFZWGfxUBIFoolEkeR29DAQusN6VjLk4hPOiNr-cuqKePwkgyVwxLtJR9tlZbpp6m1_-UZom8z38Iaijkxh6PNXmi4Vv17T1E1FCk5N8Ehon/https%3A%2F%2Fwww.icann.org%2Fprivacy%2Ftos>). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220126/539d3b09/attachment-0001.html>