[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains

Tue Jan 25 19:57:46 UTC 2022

Thank you for comments Anne.

Jim

On 25 Jan 2022, at 14:44, Aikman-Scalese, Anne wrote:

> Thanks Jim.  As you know, I am not a technical person, but your 
> observations make sense to me.  Something else I wanted to point out 
> from a procedural standpoint is that a key finding supports the 
> workflow process we have all been discussing for some time now.  That 
> key finding in one that Casey questioned:
>
> “Name collision strings cannot be measured or assessed properly 
> based on only using data from the RSS. Obtaining an accurate picture 
> of name collision risks can only be obtained via delegation.”
>
> To my mind, the language in bold above is fairly critical to the 
> analysis in relation to the recommendation the DG will be making to 
> the SSAC (and possibly then flowing through to the ICANN Board in 
> response to its questions to the SSAC.)
>
> I believe that over many DG meetings and discussions, there has been a 
> general consensus on the view that delegation will be necessary in the 
> workflow in order for the ICANN Board to determine Name Collision Risk 
> Assessment.   (This step, as I understand it, is not delegation in the 
> sense of a contract having been awarded and a permanent (or 
> semi-permanent) delegation to the root.  Rather it is an initial test 
> to be supervised by the Technical Review Committee we are 
> recommending, with results to be provided to the Board for further 
> determination.)
>
> Anne
>
>
>
>
> Anne E. Aikman-Scalese
>
> Of Counsel
>
>
>
> AAikman at lewisroca.com<mailto:AAikman at lewisroca.com>
>
> D. 520.629.4428
>
> [cid:image003.png at 01D811E9.59C478D0]
>
>
>
> From: NCAP-Discuss <ncap-discuss-bounces at icann.org> On Behalf Of James 
> Galvin
> Sent: Tuesday, January 25, 2022 12:27 PM
> To: Casey Deccio <casey at deccio.net>
> Cc: ncap-discuss at icann.org
> Subject: Re: [NCAP-Discuss] Last call for A Perspective Study of DNS 
> Queries for Non-Existent Top-Level Domains
>
> [EXTERNAL]
> ________________________________
>
> First, I have to say that we all owe a debt of gratitude to Casey for 
> his thorough and detailed review of this document. He raises some good 
> questions that should be considered and responded to directly.
>
>
> TL;DR - This question before this group is whether this document is 
> ready to be released for public comment on Thursday, 27 January 2022. 
> Speaking as a co-Chair, the question I’m considering is whether or 
> not the Key Findings of this document are at risk since, if so, the 
> document would not be ready for public comment. It is my considered 
> opinion the Key Findings are not at risk and that this document should 
> be released for public comment on Thursday. In addition, the 
> discussion that Casey has started should continue on the mailing list.
>
> In our 26 January 2022 meeting, absent any substantive objections, we 
> will declare that the consensus of the past several months of analysis 
> discussion is that the Discussion Group believes the document is ready 
> to be released for public comment.
>
>
> Long-winded response:
>
> Some of Casey’s concerns are focused on whether we have complete 
> data. This is a fair concern because we don’t have complete data. In 
> addition, in a few cases we have chosen to set aside some data sets, 
> e.g., the 5 root server data sets that were excluded from the root 
> server analysis. This is an ordinary thing to do in data science. The 
> most important thing to do is to be very clear about the data you are 
> using and note that any conclusions are only based on what you know 
> (i.e., you don’t know what you don’t know). We do this.
>
> We also know that we will never get complete data. We have not said a 
> lot about this. So far we have noted that there are some legal 
> constraints associated with a number of parties sharing the data we do 
> have. In fact, although Verisign has used its data for some detailed 
> analysis, for which we are extremely grateful, consistent with other 
> root server operators, they have not made their data generally 
> available to others. Public Recursive Resolvers have presented the 
> same issue to us; there was only one that did some level of analysis 
> for us and another with even more limited analysis.
>
> As a result of both of these points, we have not done a complete and 
> thorough analysis of all possible data. Nonetheless, I do believe that 
> our conclusions are supported by the data we have.
>
> The first key finding is that analysis of the data at any root server 
> identifier is sufficiently representative of the root server system in 
> total. Bottom line - there is some subjectivity within this statement. 
> However, statistics provides us with methods to measure the quality of 
> comparisons and Matt Thomas has done this and presented it to us. It 
> is as good as it can be, given the data we’re working with.
>
> Of course, since it is not a “perfect” analysis, there is some 
> residual risk. It is essential we capture this point and explain it in 
> our final work product. In fact, if you review the text that is 
> already under development there, the point of capturing residual risk 
> is already listed.
>
> The second key finding is that traffic observed at root servers is not 
> sufficiently representative of traffic at recursive resolvers. 
> Frankly, this point is self-evident. We may have incomplete data from 
> a single recursive resolver, but it nonetheless proves exactly this 
> point. Certainly there might be many things we could learn from a more 
> complete analysis and study of a more complete set of data at 
> recursive resolvers, but none of that changes the key finding that 
> public recursive resolvers see a different DNS infrastructure.
>
> By the way, there is also residual risk here, which is also captured 
> in final report draft text. There’s much more to say here to explain 
> it, but it too does not change the key finding.
>
> As Casey points out, the implications noted in these key findings are 
> subject to discussion in a broader context. These implications will be 
> brought forward to the final work product and discussed within the 
> context of the workflow we have developed.
>
> In summary, there are two important things to consider. First, is the 
> data analysis sufficient to support the key findings? Second, is the 
> residual risk a fundamental concern or an ordinary risk management 
> question to be considered?
>
> Some may be concerned they can not evaluate these issues directly 
> themselves. My suggestion is that we continue this discussion of these 
> technical issues on the mailing list. This will facilitate thoughtful 
> and detailed responses, and allow everyone the opportunity to share 
> the discussion with other experts to review the details.
>
> See you all tomorrow,
>
> Jim
>
>
>
>
> On 25 Jan 2022, at 11:19, Casey Deccio wrote:
> My apologies that I am responding to my own email.  Someone noted to 
> me that I neglected two very important points.  First, I mentioned 
> that I didn’t agree with the conclusions of the document, but I only 
> provided my critique of the analysis, not the conclusions.  Second, I 
> have not explicitly provided any suggestions for a path forward.  Let 
> me correct that by acting on those suggestions.
>
> Conclusions
> ---------------
>
> ** Study 1 Key Observations
> I have no disagreement with this section.  These are an accurate 
> summary of the results of Study 1.
>
> ** Study 2 Key Observations
> The following statement is true, but the qualifying factor “top” 
> is based on a comparison (query count and IP address diversity) that 
> is unfair (see Concerns 5 and 6).
>
> “Initial results from one PRR indicate there is a difference in top 
> non-existent TLDs using either query volume or source diversity 
> measurements.”
>
> The following two statements are generalities inferred from the 
> previous statement, and which are not supported by the data, precisely 
> because of Concerns 5 and 6.
>
> “Many non-existent TLDs (roughly 40%) observed at the PRR are not in 
> the top RSIs based on query volume. Nearly 30% observed at the PRR are 
> not in the top RSIs based on source diversity.”
>
> “… name collision strings cannot be measured or assessed properly 
> based on only using data from the RSS.”
>
> I agree and sympathize with the notion that there were heavy 
> constraints of privacy and data aggregation associated with the 
> analysis of the public recursive resolver data, but the comparison 
> made thus far is an unfair comparison.
>
>
> ** Key Findings
>
> The following statement is true, but based on analysis that was 
> performed on highly biased data, specifically less than 1% of IP 
> addresses observed at the root servers and top 10,000 and 1,000 of 
> non-existent TLDs (see Concerns 1, 2, 3, 4, and 7):
>
> “Non-existent DNS queries for top querying and top source diversity 
> TLDs appear to be comparable and representative at any RSI.”
>
> (Nit: Sentence reads “Non-existent DNS queries”, but I think what 
> is meant is “DNS queries for non-existent TLDs.)
>
> The following statement is inconclusive because of the bias in the 
> data that was analyzed.
>
> “PRR data further indicates that there is a very different view of 
> the top non-existent TLDs based both on query volume and source 
> diversity.”
>
> I do not *disagree* with the following statement:
>
> “ICANN, as the operator for the L RSI, is well-positioned to 
> instrument, collect, analyze, and disseminate name collision 
> measurements to subsequent gTLD applicants both prior to submission 
> and during the application review.”
>
> But I feel like the point of the document was to motivate this with 
> “You’ve seen one, you’ve seen them all.”  And the analysis 
> does not support that, at least not as generally as it was stated.
>
> I do not believe that following statements are supported by the data:
>
> “Name collision traffic observed at the root is not sufficiently 
> representative of traffic received at recursive resolvers to guarantee 
> a complete and or accurate representation of a string’s potential 
> name collision risks and impacts.”
>
> “Name collision strings cannot be measured or assessed properly 
> based on only using data from the RSS. Obtaining an accurate picture 
> of name collision risks can only be obtained via delegation.”
>
> These might be well true, but the analysis in this document does not 
> motivate this (see Concerns 5 and 6).  There are other factors that 
> might contribute to these, some considered in this document (negative 
> caching) and others not (local root and aggressive negative caching).  
> But my point is that the current analysis does not lead me to the 
> conclusions included in the sentences above.
>
>
> Suggestions for Improvement
> ---------------
>
> ** Limit conclusions in “Key Findings”.
>
> Study 1 Key Observations is an example of conclusions that *can* be 
> drawn from the existing analysis.  The step from these to generalities 
> of representativeness of data in “Key Findings” is where my 
> concerns lie.  If the “Key Findings” related to Study 1 are honed 
> to be in scope with the analysis, their impact might be significantly 
> less, but it gives me less concern.
>
> ** Revamp Root Server Analysis.
> If the purpose of the document is to come to more general conclusions, 
> such as those previously mentioned, then the analysis needs to be 
> revamped:
> -       Rather than selecting biased data (i.e., top talkers), the 
> data must be representative of IP prefix, ASN, IP version.  Those are 
> completely missing from the current analysis.
> -       Before any filtering is done, a comprehensive analysis should 
> be performed, with all IP addresses.  Even if the conclusion is that a 
> filter of some sort is appropriate, and a representative set can be 
> yielded with such filtering, there is no comparison given, and the 
> reader is simply left to make that leap of faith.
> -       Any filtering should not limit the analysis to the IP 
> addresses with the most queries—certainly not the top 1%.  Some 
> filter might by query count might be fine, but it should be a low bar, 
> and it should be justified by behavior and representation (IP prefix, 
> ASN, IP version).
>
> ** Revamp Comparison of Root Server Queries and Public Recursive 
> Resolver
> I understand that there are data constraints within which the 
> recursive data must be analyzed, but there are analyses that *can* be 
> done, even within those constraints.  For example, rather than sorting 
> by top query count and IP address diversity, start with the complete 
> set of non-existent TLDs and (if the data includes it) full QNAME 
> diversity, or at least SLD diversity.  As it is, the comparison of 
> root server data and public recursive resolver data is unfair and 
> therefore does not provide substance.
>
>
> ** Accept Data-Driven Conclusions
> There might be some desirable conclusions that the data simply does 
> not support.  I have little concern with listing conclusions that are 
> not what was anticipated beforehand (i.e., hypotheses).  Even very 
> caveated conclusions coming from an analysis can be enough to inform 
> decisions, if those caveats are considered for what they are worth.  I 
> have great concern, however, with conclusions that are not data-driven
>
>
>
> On Jan 24, 2022, at 11:25 PM, Casey Deccio 
> <casey at deccio.net<mailto:casey at deccio.net>> wrote:
>
> Dear all,
>
> I have taken the time to study the “Perspective” document, as well 
> as the document “Case Study of Collision Strings”, which is also 
> being produced by NCAP in connection with Study 2.  I appreciate all 
> the time and effort that has gone into the analysis contained in “A 
> Perspective Study of DNS Queries for Non-Existent Top-Level 
> Domains”.  I know that it has required no small effort.
>
> Nonetheless, I have fundamental concerns about the analysis contained 
> in “Perspective”, and I also do not agree with the conclusions 
> that are drawn from the analysis.  Additionally, I find the analysis 
> and conclusions in “Perspective” to be at odds with those 
> contained in “Case Study”.  Finally, I believe my concerns to be 
> substantial enough that they cannot be corrected with minor edits, and 
> I *do not* support the document moving forward.  I herein detail my 
> concerns.
>
> Sincerely,
> Casey
>
>
>
> Summary:
> Concern 1: Analysis based on biased sample of querying IP addresses.
> Concern 2: Sample data refined to support the conclusion.
> Concern 3: Analysis based on biased sample of non-existent TLDs.
> Concern 4:  TLDs considered without QNAME context.
> Concern 5: Query count used as comparison between recursive server and 
> root servers.
> Concern 6: Unique IP addresses used as comparison between recursive 
> server and root servers.
> Concern 7: Disagrees with findings from “Case Study of Collision 
> Strings”.
>
>
> Details:
>
>
> **Concern 1: Analysis based on biased sample of querying IP addresses.
>
> The sample on which the analysis and conclusions are based is selected 
> exclusively by proportion of queries observed during the collection 
> period.  Specifically, fewer than 1% (0.67% or 115K) of the 17M IP 
> addresses (15.51M IPv4 and 1.56M IPv6) observed across the 2020 DITL 
> are considered for the analysis—those producing the most queries 
> (90% of the DITL data); that excludes 99% of IP addresses from 
> analysis.  Because the set of “top talker” IP addresses is 
> selected based only on the volume of traffic, it is severely biased 
> and is not necessarily representative of resolvers world-wide.  Those 
> that query most—for whatever reasons—are the loudest, and without 
> further examination, it’s hard to even know why. The concern is not 
> even just whether or not it is okay to exclude non-top-talkers, but 
> whether top-talkers are themselves an appropriate representation.  
> Other metrics that could be used to quantify network representation 
> for the selection process and/or analysis of top-talkers are missing 
> from the analysis, including IP prefix (e.g., /16, /24, /48, /64), 
> ASN, and even IP version.  See also Concern 7 for more.
>
> The analysis in Annex 2 is very interesting, but does not, by itself, 
> resolve this concern.  The annex provides some very helpful lists of 
> top queries from the few-queries resulting in NXDOMAIN responses, and 
> there are some comparisons of the percentage of queries of resulting 
> in NXDOMAIN responses for the total given number of queries, but even 
> those are difficult to assess without a full behavioral analysis.
>
>
> **Concern 2: Sample data refined to support the conclusion.
>
> While the original sample is already of questionable representation 
> (less than 1% of IP addresses observed, based solely on query volume), 
> that dataset is further refined, according to the following text 
> (i.e., from the document):
>
>  “On average, each RSI observed 96% of the top talkers that account 
> for 90% of total traffic.  That percentage drops to 94% when using the 
> 95th percentile top talkers. Based on these findings, only the 90th 
> percentile top talkers were used for the remaining measurements in 
> this study.”
>
> If the objective of the analysis is to quantify the overlap of 
> observed query data across the root servers, and to ultimately 
> determine whether the queries observed at one server are 
> representative of the queries observed across all samples, then 
> refinement of sampled IP addresses to support that conclusion is 
> inappropriate.
>
>
> **Concern 3: Analysis based on biased sample of non-existent TLDs.
>
> The queries for non-existent TLDs, which result in NXDOMAIN responses 
> at the root servers, are compared across the root servers, to see how 
> well they are represented.  However, like observed IP addresses 
> (Concern 1), the non-existent TLDs are limited to those corresponding 
> to the most queries observed—both the top 10,000 and the top 1,000.  
> This is independent of querying IP address, ASN, and other aggregating 
> features, which would help better understand the diversity of the 
> queries for each non-existent TLD.  For example, it might be that the 
> non-existent TLDs most queried for come from a small pool of IP 
> addresses or networks, and others are being excluded simply because 
> they are outside that sample.
>
>
> **Concern 4:  TLDs considered without QNAME context.
>
> While comparisons are made to measure the representativeness of 
> non-existent TLDs, one primary feature missing from the analysis is 
> the QNAME.  In all cases, the non-existent TLD is considered in 
> isolation, yet QNAME context is shown in the analysis to be a 
> significant contributor to quantifying name collisions potential (see 
> Concern 7).
>
>
> **Concern 5: Query count used as comparison between recursive server 
> and root servers.
>
> Because of (negative) caching at recursive servers, it is expected 
> that queries observed at the root servers for a given non-existent TLD 
> will be fewer than those at a recursive resolver for that same 
> non-existent TLD.  It is this very caching behavior that makes the 
> comparison of query count for a given non-existent TLD, as observed by 
> the root servers vs. a recursive resolver, an apples-to-oranges 
> comparison.  Yet the analysis includes a comparison of the top 1,000 
> non-existent TLDs, ranked by query count.  Thus, no meaningful 
> conclusions can be drawn from this comparison.
>
>
> **Concern 6: Unique IP addresses used as comparison between recursive 
> server and root servers.
>
> Study 2 includes source diversity when comparing the query counts for 
> non-existent TLDs.  There is certainly more value in investigating IP 
> source diversity when considering the query counts for non-existent 
> TLDs that considering query counts alone (Concern 5).  However, it is 
> expected that recursive resolvers serve a very different client base 
> than authoritative servers, specifically the root servers.  Whereas 
> the former would might expect queries from stub resolvers, the latter 
> might expect queries from recursive resolvers.  In such a case, 
> analyzing client IP addresses independently of one another leaves 
> significant meaningful context out, such as the diversity of IP 
> prefixes or ASNs from which queries arrive.  A large number of IP 
> addresses from the same IP prefix or ASN might be responsible for the 
> queries associated with several “top” non-existent TLDs, excluding 
> non-existent TLDs that might have non-trivial presence but do not have 
> the top IP address diversity.  See also Concern 7.
>
> **Concern 7: Disagrees with findings from “Case Study of Collision 
> Strings”.
>
> The document “Case Study of Collision Strings”, also written in 
> connection with NCAP Study 2, contains the following findings:
>
> 1.     “A relatively small number of origin ASNs account for the 
> vast majority of query traffic for .CORP, .HOME, and .MAIL. In all 
> cases roughly 200 ASNs make up nearly 90% of the volume” (section 
> 4.1.5).
>
> 2.     “Label analysis provides a unique observational context into 
> the underlying systems, networks, and protocols inducing leakage of 
> DNS queries to the global DNS ecosystem. Understanding the diversity 
> of labels can help provide a sense of how broadly disseminated the 
> leakage is throughout the DNS” (section 4.2.1).
>
> 3.     “The .CORP SLDs seen at both A and J (approximately 16 
> thousand) is almost equal to those seen at A-root alone, but J-root 
> sees over 30,000 .CORP SLDs that A-root does not see” (section 
> 4.3.1).
>
> 4.     “Across all names studied, while A and J saw much in common, 
> there was a non-negligible amount of uniqueness to each view. For 
> example, A and J each saw queries from the same 5717 originating ASNs, 
> but J saw 2477 ASNs that A didn't see and A saw 901 that didn't see” 
> (section 4.3.2).
>
> 5.     “A more intensive and thorough analysis would include other 
> root server vantage points to minimize potential bias in the A and J 
> catchments” (section 5.2).
>
> 6.     “Additional measurement from large recursive resolvers would 
> also help elucidate any behaviors masked by negative caching and the 
> population of stub resolvers” (section 5.2).
>
>
> These findings emphasize the following points, which are at odds with 
> the "Perspective" document:
>
> -       Including ASN (and IP prefix) in an analysis can make a 
> significant difference in the overall diversity associated with 
> observed queries.
>
> -       There is significance in the context provided by the QNAME, 
> not only in measuring diversity, but also in query representativeness 
> across root servers.
>
> -       Root servers—even just A and J—have a non-negligible 
> amount of uniqueness that is not captured—or even addressed in this 
> document.
>
> -       More root servers have a greater perspective of potential name 
> collisions than one.
>
> -       The population of stub resolvers should be considered in the 
> analysis of large recursive resolvers.
>
>
>
>
> On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss 
> <ncap-discuss at icann.org<mailto:ncap-discuss at icann.org>> wrote:
>
> NCAP DG,
> As set during our last meeting on 19 January 2022, we pushed the start 
> of the public comment period for “A Perspective Study of DNS Queries 
> for Non-Existent Top-Level Domains” to this Thursday, 27 January 
> 2022, in order to accommodate some last minute questions. 
> Additionally, as previously announced, today ends the comment period 
> for the release of this document.
> Attached is the FINAL DRAFT version of “A Perspective Study of DNS 
> Queries for Non-Existent Top-Level Domains”. If you have any 
> objections to this document being released for public comment please 
> reply to this message on the list. The objection period will close at 
> the end of our weekly meeting on Wednesday, 26 January 2022. Comments 
> that do not substantially change our stated conclusions will be 
> captured and considered after the public comment period when we will 
> be reviewing all public comments received.
> A view only version of the document is 
> here:https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#<https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view>
> Matt Thomas
> <Last Call A Perspective Study of DNS Queries for Non-Existent 
> Top-Level Domains.pdf>_______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org<mailto:NCAP-Discuss at icann.org>
> https://mm.icann.org/mailman/listinfo/ncap-discuss
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of 
> your personal data for purposes of subscribing to this mailing list 
> accordance with the ICANN Privacy Policy 
> (https://www.icann.org/privacy/policy) and the website Terms of 
> Service (https://www.icann.org/privacy/tos). You can visit the Mailman 
> link above to change your membership status or configuration, 
> including unsubscribing, setting digest-style delivery or disabling 
> delivery altogether (e.g., for a vacation), and so on.
>
>
>
> _______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org<mailto:NCAP-Discuss at icann.org>
> https://mm.icann.org/mailman/listinfo/ncap-discuss
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of 
> your personal data for purposes of subscribing to this mailing list 
> accordance with the ICANN Privacy Policy 
> (https://www.icann.org/privacy/policy) and the website Terms of 
> Service (https://www.icann.org/privacy/tos). You can visit the Mailman 
> link above to change your membership status or configuration, 
> including unsubscribing, setting digest-style delivery or disabling 
> delivery altogether (e.g., for a vacation), and so on.
>
> ________________________________
>
> This message and any attachments are intended only for the use of the 
> individual or entity to which they are addressed. If the reader of 
> this message or an attachment is not the intended recipient or the 
> employee or agent responsible for delivering the message or attachment 
> to the intended recipient you are hereby notified that any 
> dissemination, distribution or copying of this message or any 
> attachment is strictly prohibited. If you have received this 
> communication in error, please notify us immediately by replying to 
> the sender. The information transmitted in this message and any 
> attachments may be privileged, is intended only for the personal and 
> confidential use of the intended recipients, and is covered by the 
> Electronic Communications Privacy Act, 18 U.S.C. §2510-2521.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220125/7ecbfe9b/attachment-0001.html>