[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains

Tue Jan 25 19:27:08 UTC 2022

First, I have to say that we all owe a debt of gratitude to Casey for 
his thorough and detailed review of this document.  He raises some good 
questions that should be considered and responded to directly.

TL;DR - This question before this group is whether this document is 
ready to be released for public comment on Thursday, 27 January 2022.  
Speaking as a co-Chair, the question I’m considering is whether or not 
the Key Findings of this document are at risk since, if so, the document 
would not be ready for public comment.  It is my considered opinion the 
Key Findings are not at risk and that this document should be released 
for public comment on Thursday.  In addition, the discussion that Casey 
has started should continue on the mailing list.

In our 26 January 2022 meeting, absent any substantive objections, we 
will declare that the consensus of the past several months of analysis 
discussion is that the Discussion Group believes the document is ready 
to be released for public comment.

Long-winded response:

Some of Casey’s concerns are focused on whether we have complete data. 
  This is a fair concern because we don’t have complete data.  In 
addition, in a few cases we have chosen to set aside some data sets, 
e.g., the 5 root server data sets that were excluded from the root 
server analysis.  This is an ordinary thing to do in data science.  The 
most important thing to do is to be very clear about the data you are 
using and note that any conclusions are only based on what you know 
(i.e., you don’t know what you don’t know).  We do this.

We also know that we will never get complete data.  We have not said a 
lot about this.  So far we have noted that there are some legal 
constraints associated with a number of parties sharing the data we do 
have.  In fact, although Verisign has used its data for some detailed 
analysis, for which we are extremely grateful, consistent with other 
root server operators, they have not made their data generally available 
to others.  Public Recursive Resolvers have presented the same issue to 
us; there was only one that did some level of analysis for us and 
another with even more limited analysis.

As a result of both of these points, we have not done a complete and 
thorough analysis of all possible data.  Nonetheless, I do believe that 
our conclusions are supported by the data we have.

The first key finding is that analysis of the data at any root server 
identifier is sufficiently representative of the root server system in 
total.  Bottom line - there is some subjectivity within this statement.  
However, statistics provides us with methods to measure the quality of 
comparisons and Matt Thomas has done this and presented it to us.  It is 
as good as it can be, given the data we’re working with.

Of course, since it is not a “perfect” analysis, there is some 
residual risk.  It is essential we capture this point and explain it in 
our final work product.  In fact, if you review the text that is already 
under development there, the point of capturing residual risk is already 
listed.

The second key finding is that traffic observed at root servers is not 
sufficiently representative of traffic at recursive resolvers.  Frankly, 
this point is self-evident.  We may have incomplete data from a single 
recursive resolver, but it nonetheless proves exactly this point.  
Certainly there might be many things we could learn from a more complete 
analysis and study of a more complete set of data at recursive 
resolvers, but none of that changes the key finding that public 
recursive resolvers see a different DNS infrastructure.

By the way, there is also residual risk here, which is also captured in 
final report draft text.  There’s much more to say here to explain it, 
but it too does not change the key finding.

As Casey points out, the implications noted in these key findings are 
subject to discussion in a broader context.  These implications will be 
brought forward to the final work product and discussed within the 
context of the workflow we have developed.

In summary, there are two important things to consider.  First, is the 
data analysis sufficient to support the key findings?  Second, is the 
residual risk a fundamental concern or an ordinary risk management 
question to be considered?

Some may be concerned they can not evaluate these issues directly 
themselves.  My suggestion is that we continue this discussion of these 
technical issues on the mailing list.  This will facilitate thoughtful 
and detailed responses, and allow everyone the opportunity to share the 
discussion with other experts to review the details.

See you all tomorrow,

Jim

On 25 Jan 2022, at 11:19, Casey Deccio wrote:

> My apologies that I am responding to my own email.  Someone noted to 
> me that I neglected two very important points.  First, I mentioned 
> that I didn’t agree with the conclusions of the document, but I only 
> provided my critique of the analysis, not the conclusions.  Second, I 
> have not explicitly provided any suggestions for a path forward.  Let 
> me correct that by acting on those suggestions.
>
>
> Conclusions
> ---------------
>
> ** Study 1 Key Observations
> I have no disagreement with this section.  These are an accurate 
> summary of the results of Study 1.
>
>
> ** Study 2 Key Observations
> The following statement is true, but the qualifying factor “top” 
> is based on a comparison (query count and IP address diversity) that 
> is unfair (see Concerns 5 and 6).
>
> “Initial results from one PRR indicate there is a difference in top 
> non-existent TLDs using either query volume or source diversity 
> measurements.”
>
> The following two statements are generalities inferred from the 
> previous statement, and which are not supported by the data, precisely 
> because of Concerns 5 and 6.
>
> “Many non-existent TLDs (roughly 40%) observed at the PRR are not in 
> the top RSIs based on query volume. Nearly 30% observed at the PRR are 
> not in the top RSIs based on source diversity.”
>
> “… name collision strings cannot be measured or assessed properly 
> based on only using data from the RSS.”
>
> I agree and sympathize with the notion that there were heavy 
> constraints of privacy and data aggregation associated with the 
> analysis of the public recursive resolver data, but the comparison 
> made thus far is an unfair comparison.
>
>
> ** Key Findings
>
> The following statement is true, but based on analysis that was 
> performed on highly biased data, specifically less than 1% of IP 
> addresses observed at the root servers and top 10,000 and 1,000 of 
> non-existent TLDs (see Concerns 1, 2, 3, 4, and 7):
>
> “Non-existent DNS queries for top querying and top source diversity 
> TLDs appear to be comparable and representative at any RSI.”
>
> (Nit: Sentence reads “Non-existent DNS queries”, but I think what 
> is meant is “DNS queries for non-existent TLDs.)
>
> The following statement is inconclusive because of the bias in the 
> data that was analyzed.
>
> “PRR data further indicates that there is a very different view of 
> the top non-existent TLDs based both on query volume and source 
> diversity.”
>
> I do not *disagree* with the following statement:
>
> “ICANN, as the operator for the L RSI, is well-positioned to 
> instrument, collect, analyze, and disseminate name collision 
> measurements to subsequent gTLD applicants both prior to submission 
> and during the application review.”
>
> But I feel like the point of the document was to motivate this with 
> “You’ve seen one, you’ve seen them all.”  And the analysis 
> does not support that, at least not as generally as it was stated.
>
> I do not believe that following statements are supported by the data:
>
> “Name collision traffic observed at the root is not sufficiently 
> representative of traffic received at recursive resolvers to guarantee 
> a complete and or accurate representation of a string’s potential 
> name collision risks and impacts.”
>
> “Name collision strings cannot be measured or assessed properly 
> based on only using data from the RSS. Obtaining an accurate picture 
> of name collision risks can only be obtained via delegation.”
>
> These might be well true, but the analysis in this document does not 
> motivate this (see Concerns 5 and 6).  There are other factors that 
> might contribute to these, some considered in this document (negative 
> caching) and others not (local root and aggressive negative caching).  
> But my point is that the current analysis does not lead me to the 
> conclusions included in the sentences above.
>
>
> Suggestions for Improvement
> ---------------
>
> ** Limit conclusions in “Key Findings”.
>
> Study 1 Key Observations is an example of conclusions that *can* be 
> drawn from the existing analysis.  The step from these to generalities 
> of representativeness of data in “Key Findings” is where my 
> concerns lie.  If the “Key Findings” related to Study 1 are honed 
> to be in scope with the analysis, their impact might be significantly 
> less, but it gives me less concern.
>
>
> ** Revamp Root Server Analysis.
> If the purpose of the document is to come to more general conclusions, 
> such as those previously mentioned, then the analysis needs to be 
> revamped:
> -       Rather than selecting biased data (i.e., top talkers), the 
> data must be representative of IP prefix, ASN, IP version.  Those are 
> completely missing from the current analysis.
> -       Before any filtering is done, a comprehensive analysis should 
> be performed, with all IP addresses.  Even if the conclusion is that a 
> filter of some sort is appropriate, and a representative set can be 
> yielded with such filtering, there is no comparison given, and the 
> reader is simply left to make that leap of faith.
> -       Any filtering should not limit the analysis to the IP 
> addresses with the most queries—certainly not the top 1%.  Some 
> filter might by query count might be fine, but it should be a low bar, 
> and it should be justified by behavior and representation (IP prefix, 
> ASN, IP version).
>
>
> ** Revamp Comparison of Root Server Queries and Public Recursive 
> Resolver
> I understand that there are data constraints within which the 
> recursive data must be analyzed, but there are analyses that *can* be 
> done, even within those constraints.  For example, rather than sorting 
> by top query count and IP address diversity, start with the complete 
> set of non-existent TLDs and (if the data includes it) full QNAME 
> diversity, or at least SLD diversity.  As it is, the comparison of 
> root server data and public recursive resolver data is unfair and 
> therefore does not provide substance.
>
>
> ** Accept Data-Driven Conclusions
> There might be some desirable conclusions that the data simply does 
> not support.  I have little concern with listing conclusions that are 
> not what was anticipated beforehand (i.e., hypotheses).  Even very 
> caveated conclusions coming from an analysis can be enough to inform 
> decisions, if those caveats are considered for what they are worth.  I 
> have great concern, however, with conclusions that are not data-driven
>
>
>> On Jan 24, 2022, at 11:25 PM, Casey Deccio <casey at deccio.net> wrote:
>>
>> Dear all,
>>
>> I have taken the time to study the “Perspective” document, as 
>> well as the document “Case Study of Collision Strings”, which is 
>> also being produced by NCAP in connection with Study 2.  I appreciate 
>> all the time and effort that has gone into the analysis contained in 
>> “A Perspective Study of DNS Queries for Non-Existent Top-Level 
>> Domains”.  I know that it has required no small effort.
>>
>> Nonetheless, I have fundamental concerns about the analysis contained 
>> in “Perspective”, and I also do not agree with the conclusions 
>> that are drawn from the analysis.  Additionally, I find the analysis 
>> and conclusions in “Perspective” to be at odds with those 
>> contained in “Case Study”.  Finally, I believe my concerns to be 
>> substantial enough that they cannot be corrected with minor edits, 
>> and I *do not* support the document moving forward.  I herein detail 
>> my concerns.
>>
>> Sincerely,
>> Casey
>>
>>
>>
>> Summary:
>> Concern 1: Analysis based on biased sample of querying IP addresses.
>> Concern 2: Sample data refined to support the conclusion.
>> Concern 3: Analysis based on biased sample of non-existent TLDs.
>> Concern 4:  TLDs considered without QNAME context.
>> Concern 5: Query count used as comparison between recursive server 
>> and root servers.
>> Concern 6: Unique IP addresses used as comparison between recursive 
>> server and root servers.
>> Concern 7: Disagrees with findings from “Case Study of Collision 
>> Strings”.
>>
>>
>> Details:
>>
>>
>> **Concern 1: Analysis based on biased sample of querying IP 
>> addresses.
>>
>> The sample on which the analysis and conclusions are based is 
>> selected exclusively by proportion of queries observed during the 
>> collection period.  Specifically, fewer than 1% (0.67% or 115K) of 
>> the 17M IP addresses (15.51M IPv4 and 1.56M IPv6) observed across the 
>> 2020 DITL are considered for the analysis—those producing the most 
>> queries (90% of the DITL data); that excludes 99% of IP addresses 
>> from analysis.  Because the set of “top talker” IP addresses is 
>> selected based only on the volume of traffic, it is severely biased 
>> and is not necessarily representative of resolvers world-wide.  Those 
>> that query most—for whatever reasons—are the loudest, and without 
>> further examination, it’s hard to even know why. The concern is not 
>> even just whether or not it is okay to exclude non-top-talkers, but 
>> whether top-talkers are themselves an appropriate representation.  
>> Other metrics that could be used to quantify network representation 
>> for the selection process and/or analysis of top-talkers are missing 
>> from the analysis, including IP prefix (e.g., /16, /24, /48, /64), 
>> ASN, and even IP version.  See also Concern 7 for more.
>>
>> The analysis in Annex 2 is very interesting, but does not, by itself, 
>> resolve this concern.  The annex provides some very helpful lists of 
>> top queries from the few-queries resulting in NXDOMAIN responses, and 
>> there are some comparisons of the percentage of queries of resulting 
>> in NXDOMAIN responses for the total given number of queries, but even 
>> those are difficult to assess without a full behavioral analysis.
>>
>>
>> **Concern 2: Sample data refined to support the conclusion.
>>
>> While the original sample is already of questionable representation 
>> (less than 1% of IP addresses observed, based solely on query 
>> volume), that dataset is further refined, according to the following 
>> text (i.e., from the document):
>>
>>  “On average, each RSI observed 96% of the top talkers that account 
>> for 90% of total traffic.  That percentage drops to 94% when using 
>> the 95th percentile top talkers. Based on these findings, only the 
>> 90th percentile top talkers were used for the remaining measurements 
>> in this study.”
>>
>> If the objective of the analysis is to quantify the overlap of 
>> observed query data across the root servers, and to ultimately 
>> determine whether the queries observed at one server are 
>> representative of the queries observed across all samples, then 
>> refinement of sampled IP addresses to support that conclusion is 
>> inappropriate.
>>
>>
>> **Concern 3: Analysis based on biased sample of non-existent TLDs.
>>
>> The queries for non-existent TLDs, which result in NXDOMAIN responses 
>> at the root servers, are compared across the root servers, to see how 
>> well they are represented.  However, like observed IP addresses 
>> (Concern 1), the non-existent TLDs are limited to those corresponding 
>> to the most queries observed—both the top 10,000 and the top 1,000. 
>>  This is independent of querying IP address, ASN, and other 
>> aggregating features, which would help better understand the 
>> diversity of the queries for each non-existent TLD.  For example, it 
>> might be that the non-existent TLDs most queried for come from a 
>> small pool of IP addresses or networks, and others are being excluded 
>> simply because they are outside that sample.
>>
>>
>> **Concern 4:  TLDs considered without QNAME context.
>>
>> While comparisons are made to measure the representativeness of 
>> non-existent TLDs, one primary feature missing from the analysis is 
>> the QNAME.  In all cases, the non-existent TLD is considered in 
>> isolation, yet QNAME context is shown in the analysis to be a 
>> significant contributor to quantifying name collisions potential (see 
>> Concern 7).
>>
>>
>> **Concern 5: Query count used as comparison between recursive server 
>> and root servers.
>>
>> Because of (negative) caching at recursive servers, it is expected 
>> that queries observed at the root servers for a given non-existent 
>> TLD will be fewer than those at a recursive resolver for that same 
>> non-existent TLD.  It is this very caching behavior that makes the 
>> comparison of query count for a given non-existent TLD, as observed 
>> by the root servers vs. a recursive resolver, an apples-to-oranges 
>> comparison.  Yet the analysis includes a comparison of the top 1,000 
>> non-existent TLDs, ranked by query count.  Thus, no meaningful 
>> conclusions can be drawn from this comparison.
>>
>>
>> **Concern 6: Unique IP addresses used as comparison between recursive 
>> server and root servers.
>>
>> Study 2 includes source diversity when comparing the query counts for 
>> non-existent TLDs.  There is certainly more value in investigating IP 
>> source diversity when considering the query counts for non-existent 
>> TLDs that considering query counts alone (Concern 5).  However, it is 
>> expected that recursive resolvers serve a very different client base 
>> than authoritative servers, specifically the root servers.  Whereas 
>> the former would might expect queries from stub resolvers, the latter 
>> might expect queries from recursive resolvers.  In such a case, 
>> analyzing client IP addresses independently of one another leaves 
>> significant meaningful context out, such as the diversity of IP 
>> prefixes or ASNs from which queries arrive.  A large number of IP 
>> addresses from the same IP prefix or ASN might be responsible for the 
>> queries associated with several “top” non-existent TLDs, 
>> excluding non-existent TLDs that might have non-trivial presence but 
>> do not have the top IP address diversity.  See also Concern 7.
>>
>>
>> **Concern 7: Disagrees with findings from “Case Study of Collision 
>> Strings”.
>>
>> The document “Case Study of Collision Strings”, also written in 
>> connection with NCAP Study 2, contains the following findings:
>>
>> 1.     “A relatively small number of origin ASNs account for the 
>> vast majority of query traffic for .CORP, .HOME, and .MAIL. In all 
>> cases roughly 200 ASNs make up nearly 90% of the volume” (section 
>> 4.1.5).
>>
>> 2.     “Label analysis provides a unique observational context into 
>> the underlying systems, networks, and protocols inducing leakage of 
>> DNS queries to the global DNS ecosystem. Understanding the diversity 
>> of labels can help provide a sense of how broadly disseminated the 
>> leakage is throughout the DNS” (section 4.2.1).
>>
>> 3.     “The .CORP SLDs seen at both A and J (approximately 16 
>> thousand) is almost equal to those seen at A-root alone, but J-root 
>> sees over 30,000 .CORP SLDs that A-root does not see” (section 
>> 4.3.1).
>>
>> 4.     “Across all names studied, while A and J saw much in common, 
>> there was a non-negligible amount of uniqueness to each view. For 
>> example, A and J each saw queries from the same 5717 originating 
>> ASNs, but J saw 2477 ASNs that A didn't see and A saw 901 that didn't 
>> see” (section 4.3.2).
>>
>> 5.     “A more intensive and thorough analysis would include other 
>> root server vantage points to minimize potential bias in the A and J 
>> catchments” (section 5.2).
>>
>> 6.     “Additional measurement from large recursive resolvers would 
>> also help elucidate any behaviors masked by negative caching and the 
>> population of stub resolvers” (section 5.2).
>>
>>
>> These findings emphasize the following points, which are at odds with 
>> the "Perspective" document:
>>
>> -       Including ASN (and IP prefix) in an analysis can make a 
>> significant difference in the overall diversity associated with 
>> observed queries.
>>
>> -       There is significance in the context provided by the QNAME, 
>> not only in measuring diversity, but also in query representativeness 
>> across root servers.
>>
>> -       Root servers—even just A and J—have a non-negligible 
>> amount of uniqueness that is not captured—or even addressed in this 
>> document.
>>
>> -       More root servers have a greater perspective of potential 
>> name collisions than one.
>>
>> -       The population of stub resolvers should be considered in the 
>> analysis of large recursive resolvers.
>>
>>
>>
>>> On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss 
>>> <ncap-discuss at icann.org <mailto:ncap-discuss at icann.org>> wrote:
>>>
>>> NCAP DG,
>>> As set during our last meeting on 19 January 2022, we pushed the 
>>> start of the public comment period for “A Perspective Study of DNS 
>>> Queries for Non-Existent Top-Level Domains” to this Thursday, 27 
>>> January 2022, in order to accommodate some last minute questions. 
>>> Additionally, as previously announced, today ends the comment period 
>>> for the release of this document.
>>> Attached is the FINAL DRAFT version of “A Perspective Study of DNS 
>>> Queries for Non-Existent Top-Level Domains”. If you have any 
>>> objections to this document being released for public comment please 
>>> reply to this message on the list. The objection period will close 
>>> at the end of our weekly meeting on Wednesday, 26 January 2022. 
>>> Comments that do not substantially change our stated conclusions 
>>> will be captured and considered after the public comment period when 
>>> we will be reviewing all public comments received.
>>> A view only version of the document is 
>>> here:https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view# 
>>> <https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#>
>>> Matt Thomas
>>> <Last Call A Perspective Study of DNS Queries for Non-Existent 
>>> Top-Level 
>>> Domains.pdf>_______________________________________________
>>> NCAP-Discuss mailing list
>>> NCAP-Discuss at icann.org <mailto:NCAP-Discuss at icann.org>
>>> https://mm.icann.org/mailman/listinfo/ncap-discuss 
>>> <https://mm.icann.org/mailman/listinfo/ncap-discuss>
>>>
>>> _______________________________________________
>>> By submitting your personal data, you consent to the processing of 
>>> your personal data for purposes of subscribing to this mailing list 
>>> accordance with the ICANN Privacy Policy 
>>> (https://www.icann.org/privacy/policy 
>>> <https://www.icann.org/privacy/policy> ) and the website Terms of 
>>> Service (https://www.icann.org/privacy/tos 
>>> <https://www.icann.org/privacy/tos> ). You can visit the Mailman 
>>> link above to change your membership status or configuration, 
>>> including
> unsubscribing, setting digest-style delivery or disabling delivery 
> altogether (e.g., for a vacation), and so on.
>>

> _______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org
> https://mm.icann.org/mailman/listinfo/ncap-discuss
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of 
> your personal data for purposes of subscribing to this mailing list 
> accordance with the ICANN Privacy Policy 
> (https://www.icann.org/privacy/policy) and the website Terms of 
> Service (https://www.icann.org/privacy/tos). You can visit the Mailman 
> link above to change your membership status or configuration, 
> including unsubscribing, setting digest-style delivery or disabling 
> delivery altogether (e.g., for a vacation), and so on.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220125/7237c29f/attachment-0001.html>