[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains

Tue Jan 25 21:19:45 UTC 2022

Jim, thanks for the response.

Two comments:

1. My concerns have been grossly misrepresented in your response.  My concerns are not about "incomplete data" but about using only a biased subset of available data to drive conclusions.  The "collisions" document works from the same data set as the "perspective" document (for the root server analysis), and I do not have the same concerns with the "collisions" document because it does not suffer from the same data issues.  Also, the conclusions drawn from the "perspective" document are very different from that of the "collisions" document; the incongruence is glaring.

2. Even in the case where there is incomplete data (e.g., the recursive resolver data), the fact that the set is incomplete is not my concern.  The concern is making comparisons that are unfair or biased with the data that we have, and drawing conclusions from those comparisons.

While I understand that my text was lengthy, I've made some effort to organize my thoughts.  I am happy to answer further questions, but I encourage you and others to read and evaluate those concerns.

Finally, I reference the following statement from your email:

> ...the consensus of the past several months of analysis discussion is that the Discussion Group believes the document is ready to be released for public comment.

I agree that there have been discussions over the past several months, and perhaps absence of objection is consensus.  But having a document in hand for review is a completely different matter than simply discussing it in a meeting.  The document was first posted on Dec 21, right before the holiday season for many of us, and with significant changes having been integrated in the document just in the past week and a half.  So... can there really be consensus with the document at this point?

Casey

> On Jan 25, 2022, at 12:27 PM, James Galvin <galvin at elistx.com> wrote:
> 
> First, I have to say that we all owe a debt of gratitude to Casey for his thorough and detailed review of this document. He raises some good questions that should be considered and responded to directly.
> 
> 
> TL;DR - This question before this group is whether this document is ready to be released for public comment on Thursday, 27 January 2022. Speaking as a co-Chair, the question I’m considering is whether or not the Key Findings of this document are at risk since, if so, the document would not be ready for public comment. It is my considered opinion the Key Findings are not at risk and that this document should be released for public comment on Thursday. In addition, the discussion that Casey has started should continue on the mailing list.
> 
> In our 26 January 2022 meeting, absent any substantive objections, we will declare that the consensus of the past several months of analysis discussion is that the Discussion Group believes the document is ready to be released for public comment.
> 
> 
> Long-winded response:
> 
> Some of Casey’s concerns are focused on whether we have complete data. This is a fair concern because we don’t have complete data. In addition, in a few cases we have chosen to set aside some data sets, e.g., the 5 root server data sets that were excluded from the root server analysis. This is an ordinary thing to do in data science. The most important thing to do is to be very clear about the data you are using and note that any conclusions are only based on what you know (i.e., you don’t know what you don’t know). We do this.
> 
> We also know that we will never get complete data. We have not said a lot about this. So far we have noted that there are some legal constraints associated with a number of parties sharing the data we do have. In fact, although Verisign has used its data for some detailed analysis, for which we are extremely grateful, consistent with other root server operators, they have not made their data generally available to others. Public Recursive Resolvers have presented the same issue to us; there was only one that did some level of analysis for us and another with even more limited analysis.
> 
> As a result of both of these points, we have not done a complete and thorough analysis of all possible data. Nonetheless, I do believe that our conclusions are supported by the data we have.
> 
> The first key finding is that analysis of the data at any root server identifier is sufficiently representative of the root server system in total. Bottom line - there is some subjectivity within this statement. However, statistics provides us with methods to measure the quality of comparisons and Matt Thomas has done this and presented it to us. It is as good as it can be, given the data we’re working with.
> 
> Of course, since it is not a “perfect” analysis, there is some residual risk. It is essential we capture this point and explain it in our final work product. In fact, if you review the text that is already under development there, the point of capturing residual risk is already listed.
> 
> The second key finding is that traffic observed at root servers is not sufficiently representative of traffic at recursive resolvers. Frankly, this point is self-evident. We may have incomplete data from a single recursive resolver, but it nonetheless proves exactly this point. Certainly there might be many things we could learn from a more complete analysis and study of a more complete set of data at recursive resolvers, but none of that changes the key finding that public recursive resolvers see a different DNS infrastructure.
> 
> By the way, there is also residual risk here, which is also captured in final report draft text. There’s much more to say here to explain it, but it too does not change the key finding.
> 
> As Casey points out, the implications noted in these key findings are subject to discussion in a broader context. These implications will be brought forward to the final work product and discussed within the context of the workflow we have developed.
> 
> In summary, there are two important things to consider. First, is the data analysis sufficient to support the key findings? Second, is the residual risk a fundamental concern or an ordinary risk management question to be considered?
> 
> Some may be concerned they can not evaluate these issues directly themselves. My suggestion is that we continue this discussion of these technical issues on the mailing list. This will facilitate thoughtful and detailed responses, and allow everyone the opportunity to share the discussion with other experts to review the details.
> 
> See you all tomorrow,
> 
> Jim
> 
> 
> 
> 
> 
> On 25 Jan 2022, at 11:19, Casey Deccio wrote:
> 
> My apologies that I am responding to my own email.  Someone noted to me that I neglected two very important points.  First, I mentioned that I didn’t agree with the conclusions of the document, but I only provided my critique of the analysis, not the conclusions.  Second, I have not explicitly provided any suggestions for a path forward.  Let me correct that by acting on those suggestions.
>  
> 
> Conclusions
> ---------------
>  
> ** Study 1 Key Observations
> I have no disagreement with this section.  These are an accurate summary of the results of Study 1.
>  
> 
> ** Study 2 Key Observations
> The following statement is true, but the qualifying factor “top” is based on a comparison (query count and IP address diversity) that is unfair (see Concerns 5 and 6).
>  
> “Initial results from one PRR indicate there is a difference in top non-existent TLDs using either query volume or source diversity measurements.”
>  
> The following two statements are generalities inferred from the previous statement, and which are not supported by the data, precisely because of Concerns 5 and 6.
>  
> “Many non-existent TLDs (roughly 40%) observed at the PRR are not in the top RSIs based on query volume. Nearly 30% observed at the PRR are not in the top RSIs based on source diversity.”
>  
> “… name collision strings cannot be measured or assessed properly based on only using data from the RSS.”
>  
> I agree and sympathize with the notion that there were heavy constraints of privacy and data aggregation associated with the analysis of the public recursive resolver data, but the comparison made thus far is an unfair comparison.
>  
>  
> ** Key Findings
>  
> The following statement is true, but based on analysis that was performed on highly biased data, specifically less than 1% of IP addresses observed at the root servers and top 10,000 and 1,000 of non-existent TLDs (see Concerns 1, 2, 3, 4, and 7):
>  
> “Non-existent DNS queries for top querying and top source diversity TLDs appear to be comparable and representative at any RSI.”
>  
> (Nit: Sentence reads “Non-existent DNS queries”, but I think what is meant is “DNS queries for non-existent TLDs.)
>  
> The following statement is inconclusive because of the bias in the data that was analyzed.
>  
> “PRR data further indicates that there is a very different view of the top non-existent TLDs based both on query volume and source diversity.”
>  
> I do not *disagree* with the following statement:
>  
> “ICANN, as the operator for the L RSI, is well-positioned to instrument, collect, analyze, and disseminate name collision measurements to subsequent gTLD applicants both prior to submission and during the application review.”
>  
> But I feel like the point of the document was to motivate this with “You’ve seen one, you’ve seen them all.”  And the analysis does not support that, at least not as generally as it was stated.
>  
> I do not believe that following statements are supported by the data:
>  
> “Name collision traffic observed at the root is not sufficiently representative of traffic received at recursive resolvers to guarantee a complete and or accurate representation of a string’s potential name collision risks and impacts.”
>  
> “Name collision strings cannot be measured or assessed properly based on only using data from the RSS. Obtaining an accurate picture of name collision risks can only be obtained via delegation.”
>  
> These might be well true, but the analysis in this document does not motivate this (see Concerns 5 and 6).  There are other factors that might contribute to these, some considered in this document (negative caching) and others not (local root and aggressive negative caching).  But my point is that the current analysis does not lead me to the conclusions included in the sentences above.
>  
>  
> Suggestions for Improvement
> ---------------
>  
> ** Limit conclusions in “Key Findings”.
> 
> Study 1 Key Observations is an example of conclusions that *can* be drawn from the existing analysis.  The step from these to generalities of representativeness of data in “Key Findings” is where my concerns lie.  If the “Key Findings” related to Study 1 are honed to be in scope with the analysis, their impact might be significantly less, but it gives me less concern.
>  
> 
> ** Revamp Root Server Analysis.
> If the purpose of the document is to come to more general conclusions, such as those previously mentioned, then the analysis needs to be revamped:
> -       Rather than selecting biased data (i.e., top talkers), the data must be representative of IP prefix, ASN, IP version.  Those are completely missing from the current analysis.
> -       Before any filtering is done, a comprehensive analysis should be performed, with all IP addresses.  Even if the conclusion is that a filter of some sort is appropriate, and a representative set can be yielded with such filtering, there is no comparison given, and the reader is simply left to make that leap of faith.
> -       Any filtering should not limit the analysis to the IP addresses with the most queries—certainly not the top 1%.  Some filter might by query count might be fine, but it should be a low bar, and it should be justified by behavior and representation (IP prefix, ASN, IP version).
>  
> 
> ** Revamp Comparison of Root Server Queries and Public Recursive Resolver
> I understand that there are data constraints within which the recursive data must be analyzed, but there are analyses that *can* be done, even within those constraints.  For example, rather than sorting by top query count and IP address diversity, start with the complete set of non-existent TLDs and (if the data includes it) full QNAME diversity, or at least SLD diversity.  As it is, the comparison of root server data and public recursive resolver data is unfair and therefore does not provide substance.
>  
>  
> ** Accept Data-Driven Conclusions
> There might be some desirable conclusions that the data simply does not support.  I have little concern with listing conclusions that are not what was anticipated beforehand (i.e., hypotheses).  Even very caveated conclusions coming from an analysis can be enough to inform decisions, if those caveats are considered for what they are worth.  I have great concern, however, with conclusions that are not data-driven
> 
> 
>> On Jan 24, 2022, at 11:25 PM, Casey Deccio <casey at deccio.net <mailto:casey at deccio.net>> wrote:
>> 
>> Dear all,
>> 
>> I have taken the time to study the “Perspective” document, as well as the document “Case Study of Collision Strings”, which is also being produced by NCAP in connection with Study 2.  I appreciate all the time and effort that has gone into the analysis contained in “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains”.  I know that it has required no small effort.
>> 
>> Nonetheless, I have fundamental concerns about the analysis contained in “Perspective”, and I also do not agree with the conclusions that are drawn from the analysis.  Additionally, I find the analysis and conclusions in “Perspective” to be at odds with those contained in “Case Study”.  Finally, I believe my concerns to be substantial enough that they cannot be corrected with minor edits, and I *do not* support the document moving forward.  I herein detail my concerns.
>> 
>> Sincerely,
>> Casey
>> 
>>  
>> 
>> Summary:
>> Concern 1: Analysis based on biased sample of querying IP addresses.
>> Concern 2: Sample data refined to support the conclusion.
>> Concern 3: Analysis based on biased sample of non-existent TLDs.
>> Concern 4:  TLDs considered without QNAME context.
>> Concern 5: Query count used as comparison between recursive server and root servers.
>> Concern 6: Unique IP addresses used as comparison between recursive server and root servers.
>> Concern 7: Disagrees with findings from “Case Study of Collision Strings”.
>> 
>> 
>> Details:
>> 
>> 
>> **Concern 1: Analysis based on biased sample of querying IP addresses.
>> 
>> The sample on which the analysis and conclusions are based is selected exclusively by proportion of queries observed during the collection period.  Specifically, fewer than 1% (0.67% or 115K) of the 17M IP addresses (15.51M IPv4 and 1.56M IPv6) observed across the 2020 DITL are considered for the analysis—those producing the most queries (90% of the DITL data); that excludes 99% of IP addresses from analysis.  Because the set of “top talker” IP addresses is selected based only on the volume of traffic, it is severely biased and is not necessarily representative of resolvers world-wide.  Those that query most—for whatever reasons—are the loudest, and without further examination, it’s hard to even know why. The concern is not even just whether or not it is okay to exclude non-top-talkers, but whether top-talkers are themselves an appropriate representation.  Other metrics that could be used to quantify network representation for the selection process and/or analysis of top-talkers are missing from the analysis, including IP prefix (e.g., /16, /24, /48, /64), ASN, and even IP version.  See also Concern 7 for more.
>> 
>> The analysis in Annex 2 is very interesting, but does not, by itself, resolve this concern.  The annex provides some very helpful lists of top queries from the few-queries resulting in NXDOMAIN responses, and there are some comparisons of the percentage of queries of resulting in NXDOMAIN responses for the total given number of queries, but even those are difficult to assess without a full behavioral analysis.
>> 
>>  
>> **Concern 2: Sample data refined to support the conclusion.
>> 
>> While the original sample is already of questionable representation (less than 1% of IP addresses observed, based solely on query volume), that dataset is further refined, according to the following text (i.e., from the document):
>> 
>>  “On average, each RSI observed 96% of the top talkers that account for 90% of total traffic.  That percentage drops to 94% when using the 95th percentile top talkers. Based on these findings, only the 90th percentile top talkers were used for the remaining measurements in this study.”
>> 
>> If the objective of the analysis is to quantify the overlap of observed query data across the root servers, and to ultimately determine whether the queries observed at one server are representative of the queries observed across all samples, then refinement of sampled IP addresses to support that conclusion is inappropriate.
>> 
>>  
>> **Concern 3: Analysis based on biased sample of non-existent TLDs.
>> 
>> The queries for non-existent TLDs, which result in NXDOMAIN responses at the root servers, are compared across the root servers, to see how well they are represented.  However, like observed IP addresses (Concern 1), the non-existent TLDs are limited to those corresponding to the most queries observed—both the top 10,000 and the top 1,000.  This is independent of querying IP address, ASN, and other aggregating features, which would help better understand the diversity of the queries for each non-existent TLD.  For example, it might be that the non-existent TLDs most queried for come from a small pool of IP addresses or networks, and others are being excluded simply because they are outside that sample.
>> 
>>  
>> **Concern 4:  TLDs considered without QNAME context.
>> 
>> While comparisons are made to measure the representativeness of non-existent TLDs, one primary feature missing from the analysis is the QNAME.  In all cases, the non-existent TLD is considered in isolation, yet QNAME context is shown in the analysis to be a significant contributor to quantifying name collisions potential (see Concern 7).
>> 
>>  
>> **Concern 5: Query count used as comparison between recursive server and root servers.
>> 
>> Because of (negative) caching at recursive servers, it is expected that queries observed at the root servers for a given non-existent TLD will be fewer than those at a recursive resolver for that same non-existent TLD.  It is this very caching behavior that makes the comparison of query count for a given non-existent TLD, as observed by the root servers vs. a recursive resolver, an apples-to-oranges comparison.  Yet the analysis includes a comparison of the top 1,000 non-existent TLDs, ranked by query count.  Thus, no meaningful conclusions can be drawn from this comparison.
>> 
>>  
>> **Concern 6: Unique IP addresses used as comparison between recursive server and root servers.
>> 
>> Study 2 includes source diversity when comparing the query counts for non-existent TLDs.  There is certainly more value in investigating IP source diversity when considering the query counts for non-existent TLDs that considering query counts alone (Concern 5).  However, it is expected that recursive resolvers serve a very different client base than authoritative servers, specifically the root servers.  Whereas the former would might expect queries from stub resolvers, the latter might expect queries from recursive resolvers.  In such a case, analyzing client IP addresses independently of one another leaves significant meaningful context out, such as the diversity of IP prefixes or ASNs from which queries arrive.  A large number of IP addresses from the same IP prefix or ASN might be responsible for the queries associated with several “top” non-existent TLDs, excluding non-existent TLDs that might have non-trivial presence but do not have the top IP address diversity.  See also Concern 7.
>> 
>> 
>> **Concern 7: Disagrees with findings from “Case Study of Collision Strings”.
>> 
>> The document “Case Study of Collision Strings”, also written in connection with NCAP Study 2, contains the following findings:
>> 
>> 1.     “A relatively small number of origin ASNs account for the vast majority of query traffic for .CORP, .HOME, and .MAIL. In all cases roughly 200 ASNs make up nearly 90% of the volume” (section 4.1.5).
>> 
>> 2.     “Label analysis provides a unique observational context into the underlying systems, networks, and protocols inducing leakage of DNS queries to the global DNS ecosystem. Understanding the diversity of labels can help provide a sense of how broadly disseminated the leakage is throughout the DNS” (section 4.2.1).
>> 
>> 3.     “The .CORP SLDs seen at both A and J (approximately 16 thousand) is almost equal to those seen at A-root alone, but J-root sees over 30,000 .CORP SLDs that A-root does not see” (section 4.3.1).
>> 
>> 4.     “Across all names studied, while A and J saw much in common, there was a non-negligible amount of uniqueness to each view. For example, A and J each saw queries from the same 5717 originating ASNs, but J saw 2477 ASNs that A didn't see and A saw 901 that didn't see” (section 4.3.2).
>> 
>> 5.     “A more intensive and thorough analysis would include other root server vantage points to minimize potential bias in the A and J catchments” (section 5.2).
>> 
>> 6.     “Additional measurement from large recursive resolvers would also help elucidate any behaviors masked by negative caching and the population of stub resolvers” (section 5.2).
>> 
>> 
>> These findings emphasize the following points, which are at odds with the "Perspective" document:
>> 
>> -       Including ASN (and IP prefix) in an analysis can make a significant difference in the overall diversity associated with observed queries.
>> 
>> -       There is significance in the context provided by the QNAME, not only in measuring diversity, but also in query representativeness across root servers.
>> 
>> -       Root servers—even just A and J—have a non-negligible amount of uniqueness that is not captured—or even addressed in this document.
>> 
>> -       More root servers have a greater perspective of potential name collisions than one.
>> 
>> -       The population of stub resolvers should be considered in the analysis of large recursive resolvers.
>> 
>> 
>> 
>>> On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss <ncap-discuss at icann.org <mailto:ncap-discuss at icann.org>> wrote:
>>> 
>>> NCAP DG,
>>> As set during our last meeting on 19 January 2022, we pushed the start of the public comment period for “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains” to this Thursday, 27 January 2022, in order to accommodate some last minute questions. Additionally, as previously announced, today ends the comment period for the release of this document.
>>> Attached is the FINAL DRAFT version of “A Perspective Study of DNS Queries for Non-Existent Top-Level Domains”. If you have any objections to this document being released for public comment please reply to this message on the list. The objection period will close at the end of our weekly meeting on Wednesday, 26 January 2022. Comments that do not substantially change our stated conclusions will be captured and considered after the public comment period when we will be reviewing all public comments received.
>>> A view only version of the document is here:https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view# <https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#>
>>> Matt Thomas
>>> <Last Call A Perspective Study of DNS Queries for Non-Existent Top-Level Domains.pdf>_______________________________________________
>>> NCAP-Discuss mailing list
>>> NCAP-Discuss at icann.org <mailto:NCAP-Discuss at icann.org>
>>> https://mm.icann.org/mailman/listinfo/ncap-discuss <https://mm.icann.org/mailman/listinfo/ncap-discuss>
>>> 
>>> _______________________________________________
>>> By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy <https://www.icann.org/privacy/policy>) and the website Terms of Service (https://www.icann.org/privacy/tos <https://www.icann.org/privacy/tos>). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
>> 
> 
> _______________________________________________ 
> NCAP-Discuss mailing list 
> NCAP-Discuss at icann.org 
> https://mm.icann.org/mailman/listinfo/ncap-discuss <https://mm.icann.org/mailman/listinfo/ncap-discuss>
> _______________________________________________ 
> By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy <https://www.icann.org/privacy/policy>) and the website Terms of Service (https://www.icann.org/privacy/tos <https://www.icann.org/privacy/tos>). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220125/36b5c0cd/attachment-0001.html>