[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains
James Galvin
galvin at elistx.com
Tue Jan 25 19:57:46 UTC 2022
Thank you for comments Anne.
Jim
On 25 Jan 2022, at 14:44, Aikman-Scalese, Anne wrote:
> Thanks Jim. As you know, I am not a technical person, but your
> observations make sense to me. Something else I wanted to point out
> from a procedural standpoint is that a key finding supports the
> workflow process we have all been discussing for some time now. That
> key finding in one that Casey questioned:
>
> “Name collision strings cannot be measured or assessed properly
> based on only using data from the RSS. Obtaining an accurate picture
> of name collision risks can only be obtained via delegation.”
>
> To my mind, the language in bold above is fairly critical to the
> analysis in relation to the recommendation the DG will be making to
> the SSAC (and possibly then flowing through to the ICANN Board in
> response to its questions to the SSAC.)
>
> I believe that over many DG meetings and discussions, there has been a
> general consensus on the view that delegation will be necessary in the
> workflow in order for the ICANN Board to determine Name Collision Risk
> Assessment. (This step, as I understand it, is not delegation in the
> sense of a contract having been awarded and a permanent (or
> semi-permanent) delegation to the root. Rather it is an initial test
> to be supervised by the Technical Review Committee we are
> recommending, with results to be provided to the Board for further
> determination.)
>
> Anne
>
>
>
>
> Anne E. Aikman-Scalese
>
> Of Counsel
>
>
>
> AAikman at lewisroca.com<mailto:AAikman at lewisroca.com>
>
> D. 520.629.4428
>
> [cid:image003.png at 01D811E9.59C478D0]
>
>
>
> From: NCAP-Discuss <ncap-discuss-bounces at icann.org> On Behalf Of James
> Galvin
> Sent: Tuesday, January 25, 2022 12:27 PM
> To: Casey Deccio <casey at deccio.net>
> Cc: ncap-discuss at icann.org
> Subject: Re: [NCAP-Discuss] Last call for A Perspective Study of DNS
> Queries for Non-Existent Top-Level Domains
>
> [EXTERNAL]
> ________________________________
>
> First, I have to say that we all owe a debt of gratitude to Casey for
> his thorough and detailed review of this document. He raises some good
> questions that should be considered and responded to directly.
>
>
> TL;DR - This question before this group is whether this document is
> ready to be released for public comment on Thursday, 27 January 2022.
> Speaking as a co-Chair, the question I’m considering is whether or
> not the Key Findings of this document are at risk since, if so, the
> document would not be ready for public comment. It is my considered
> opinion the Key Findings are not at risk and that this document should
> be released for public comment on Thursday. In addition, the
> discussion that Casey has started should continue on the mailing list.
>
> In our 26 January 2022 meeting, absent any substantive objections, we
> will declare that the consensus of the past several months of analysis
> discussion is that the Discussion Group believes the document is ready
> to be released for public comment.
>
>
> Long-winded response:
>
> Some of Casey’s concerns are focused on whether we have complete
> data. This is a fair concern because we don’t have complete data. In
> addition, in a few cases we have chosen to set aside some data sets,
> e.g., the 5 root server data sets that were excluded from the root
> server analysis. This is an ordinary thing to do in data science. The
> most important thing to do is to be very clear about the data you are
> using and note that any conclusions are only based on what you know
> (i.e., you don’t know what you don’t know). We do this.
>
> We also know that we will never get complete data. We have not said a
> lot about this. So far we have noted that there are some legal
> constraints associated with a number of parties sharing the data we do
> have. In fact, although Verisign has used its data for some detailed
> analysis, for which we are extremely grateful, consistent with other
> root server operators, they have not made their data generally
> available to others. Public Recursive Resolvers have presented the
> same issue to us; there was only one that did some level of analysis
> for us and another with even more limited analysis.
>
> As a result of both of these points, we have not done a complete and
> thorough analysis of all possible data. Nonetheless, I do believe that
> our conclusions are supported by the data we have.
>
> The first key finding is that analysis of the data at any root server
> identifier is sufficiently representative of the root server system in
> total. Bottom line - there is some subjectivity within this statement.
> However, statistics provides us with methods to measure the quality of
> comparisons and Matt Thomas has done this and presented it to us. It
> is as good as it can be, given the data we’re working with.
>
> Of course, since it is not a “perfect” analysis, there is some
> residual risk. It is essential we capture this point and explain it in
> our final work product. In fact, if you review the text that is
> already under development there, the point of capturing residual risk
> is already listed.
>
> The second key finding is that traffic observed at root servers is not
> sufficiently representative of traffic at recursive resolvers.
> Frankly, this point is self-evident. We may have incomplete data from
> a single recursive resolver, but it nonetheless proves exactly this
> point. Certainly there might be many things we could learn from a more
> complete analysis and study of a more complete set of data at
> recursive resolvers, but none of that changes the key finding that
> public recursive resolvers see a different DNS infrastructure.
>
> By the way, there is also residual risk here, which is also captured
> in final report draft text. There’s much more to say here to explain
> it, but it too does not change the key finding.
>
> As Casey points out, the implications noted in these key findings are
> subject to discussion in a broader context. These implications will be
> brought forward to the final work product and discussed within the
> context of the workflow we have developed.
>
> In summary, there are two important things to consider. First, is the
> data analysis sufficient to support the key findings? Second, is the
> residual risk a fundamental concern or an ordinary risk management
> question to be considered?
>
> Some may be concerned they can not evaluate these issues directly
> themselves. My suggestion is that we continue this discussion of these
> technical issues on the mailing list. This will facilitate thoughtful
> and detailed responses, and allow everyone the opportunity to share
> the discussion with other experts to review the details.
>
> See you all tomorrow,
>
> Jim
>
>
>
>
> On 25 Jan 2022, at 11:19, Casey Deccio wrote:
> My apologies that I am responding to my own email. Someone noted to
> me that I neglected two very important points. First, I mentioned
> that I didn’t agree with the conclusions of the document, but I only
> provided my critique of the analysis, not the conclusions. Second, I
> have not explicitly provided any suggestions for a path forward. Let
> me correct that by acting on those suggestions.
>
> Conclusions
> ---------------
>
> ** Study 1 Key Observations
> I have no disagreement with this section. These are an accurate
> summary of the results of Study 1.
>
> ** Study 2 Key Observations
> The following statement is true, but the qualifying factor “top”
> is based on a comparison (query count and IP address diversity) that
> is unfair (see Concerns 5 and 6).
>
> “Initial results from one PRR indicate there is a difference in top
> non-existent TLDs using either query volume or source diversity
> measurements.”
>
> The following two statements are generalities inferred from the
> previous statement, and which are not supported by the data, precisely
> because of Concerns 5 and 6.
>
> “Many non-existent TLDs (roughly 40%) observed at the PRR are not in
> the top RSIs based on query volume. Nearly 30% observed at the PRR are
> not in the top RSIs based on source diversity.”
>
> “… name collision strings cannot be measured or assessed properly
> based on only using data from the RSS.”
>
> I agree and sympathize with the notion that there were heavy
> constraints of privacy and data aggregation associated with the
> analysis of the public recursive resolver data, but the comparison
> made thus far is an unfair comparison.
>
>
> ** Key Findings
>
> The following statement is true, but based on analysis that was
> performed on highly biased data, specifically less than 1% of IP
> addresses observed at the root servers and top 10,000 and 1,000 of
> non-existent TLDs (see Concerns 1, 2, 3, 4, and 7):
>
> “Non-existent DNS queries for top querying and top source diversity
> TLDs appear to be comparable and representative at any RSI.”
>
> (Nit: Sentence reads “Non-existent DNS queries”, but I think what
> is meant is “DNS queries for non-existent TLDs.)
>
> The following statement is inconclusive because of the bias in the
> data that was analyzed.
>
> “PRR data further indicates that there is a very different view of
> the top non-existent TLDs based both on query volume and source
> diversity.”
>
> I do not *disagree* with the following statement:
>
> “ICANN, as the operator for the L RSI, is well-positioned to
> instrument, collect, analyze, and disseminate name collision
> measurements to subsequent gTLD applicants both prior to submission
> and during the application review.”
>
> But I feel like the point of the document was to motivate this with
> “You’ve seen one, you’ve seen them all.” And the analysis
> does not support that, at least not as generally as it was stated.
>
> I do not believe that following statements are supported by the data:
>
> “Name collision traffic observed at the root is not sufficiently
> representative of traffic received at recursive resolvers to guarantee
> a complete and or accurate representation of a string’s potential
> name collision risks and impacts.”
>
> “Name collision strings cannot be measured or assessed properly
> based on only using data from the RSS. Obtaining an accurate picture
> of name collision risks can only be obtained via delegation.”
>
> These might be well true, but the analysis in this document does not
> motivate this (see Concerns 5 and 6). There are other factors that
> might contribute to these, some considered in this document (negative
> caching) and others not (local root and aggressive negative caching).
> But my point is that the current analysis does not lead me to the
> conclusions included in the sentences above.
>
>
> Suggestions for Improvement
> ---------------
>
> ** Limit conclusions in “Key Findings”.
>
> Study 1 Key Observations is an example of conclusions that *can* be
> drawn from the existing analysis. The step from these to generalities
> of representativeness of data in “Key Findings” is where my
> concerns lie. If the “Key Findings” related to Study 1 are honed
> to be in scope with the analysis, their impact might be significantly
> less, but it gives me less concern.
>
> ** Revamp Root Server Analysis.
> If the purpose of the document is to come to more general conclusions,
> such as those previously mentioned, then the analysis needs to be
> revamped:
> - Rather than selecting biased data (i.e., top talkers), the
> data must be representative of IP prefix, ASN, IP version. Those are
> completely missing from the current analysis.
> - Before any filtering is done, a comprehensive analysis should
> be performed, with all IP addresses. Even if the conclusion is that a
> filter of some sort is appropriate, and a representative set can be
> yielded with such filtering, there is no comparison given, and the
> reader is simply left to make that leap of faith.
> - Any filtering should not limit the analysis to the IP
> addresses with the most queries—certainly not the top 1%. Some
> filter might by query count might be fine, but it should be a low bar,
> and it should be justified by behavior and representation (IP prefix,
> ASN, IP version).
>
> ** Revamp Comparison of Root Server Queries and Public Recursive
> Resolver
> I understand that there are data constraints within which the
> recursive data must be analyzed, but there are analyses that *can* be
> done, even within those constraints. For example, rather than sorting
> by top query count and IP address diversity, start with the complete
> set of non-existent TLDs and (if the data includes it) full QNAME
> diversity, or at least SLD diversity. As it is, the comparison of
> root server data and public recursive resolver data is unfair and
> therefore does not provide substance.
>
>
> ** Accept Data-Driven Conclusions
> There might be some desirable conclusions that the data simply does
> not support. I have little concern with listing conclusions that are
> not what was anticipated beforehand (i.e., hypotheses). Even very
> caveated conclusions coming from an analysis can be enough to inform
> decisions, if those caveats are considered for what they are worth. I
> have great concern, however, with conclusions that are not data-driven
>
>
>
> On Jan 24, 2022, at 11:25 PM, Casey Deccio
> <casey at deccio.net<mailto:casey at deccio.net>> wrote:
>
> Dear all,
>
> I have taken the time to study the “Perspective” document, as well
> as the document “Case Study of Collision Strings”, which is also
> being produced by NCAP in connection with Study 2. I appreciate all
> the time and effort that has gone into the analysis contained in “A
> Perspective Study of DNS Queries for Non-Existent Top-Level
> Domains”. I know that it has required no small effort.
>
> Nonetheless, I have fundamental concerns about the analysis contained
> in “Perspective”, and I also do not agree with the conclusions
> that are drawn from the analysis. Additionally, I find the analysis
> and conclusions in “Perspective” to be at odds with those
> contained in “Case Study”. Finally, I believe my concerns to be
> substantial enough that they cannot be corrected with minor edits, and
> I *do not* support the document moving forward. I herein detail my
> concerns.
>
> Sincerely,
> Casey
>
>
>
> Summary:
> Concern 1: Analysis based on biased sample of querying IP addresses.
> Concern 2: Sample data refined to support the conclusion.
> Concern 3: Analysis based on biased sample of non-existent TLDs.
> Concern 4: TLDs considered without QNAME context.
> Concern 5: Query count used as comparison between recursive server and
> root servers.
> Concern 6: Unique IP addresses used as comparison between recursive
> server and root servers.
> Concern 7: Disagrees with findings from “Case Study of Collision
> Strings”.
>
>
> Details:
>
>
> **Concern 1: Analysis based on biased sample of querying IP addresses.
>
> The sample on which the analysis and conclusions are based is selected
> exclusively by proportion of queries observed during the collection
> period. Specifically, fewer than 1% (0.67% or 115K) of the 17M IP
> addresses (15.51M IPv4 and 1.56M IPv6) observed across the 2020 DITL
> are considered for the analysis—those producing the most queries
> (90% of the DITL data); that excludes 99% of IP addresses from
> analysis. Because the set of “top talker” IP addresses is
> selected based only on the volume of traffic, it is severely biased
> and is not necessarily representative of resolvers world-wide. Those
> that query most—for whatever reasons—are the loudest, and without
> further examination, it’s hard to even know why. The concern is not
> even just whether or not it is okay to exclude non-top-talkers, but
> whether top-talkers are themselves an appropriate representation.
> Other metrics that could be used to quantify network representation
> for the selection process and/or analysis of top-talkers are missing
> from the analysis, including IP prefix (e.g., /16, /24, /48, /64),
> ASN, and even IP version. See also Concern 7 for more.
>
> The analysis in Annex 2 is very interesting, but does not, by itself,
> resolve this concern. The annex provides some very helpful lists of
> top queries from the few-queries resulting in NXDOMAIN responses, and
> there are some comparisons of the percentage of queries of resulting
> in NXDOMAIN responses for the total given number of queries, but even
> those are difficult to assess without a full behavioral analysis.
>
>
> **Concern 2: Sample data refined to support the conclusion.
>
> While the original sample is already of questionable representation
> (less than 1% of IP addresses observed, based solely on query volume),
> that dataset is further refined, according to the following text
> (i.e., from the document):
>
> “On average, each RSI observed 96% of the top talkers that account
> for 90% of total traffic. That percentage drops to 94% when using the
> 95th percentile top talkers. Based on these findings, only the 90th
> percentile top talkers were used for the remaining measurements in
> this study.”
>
> If the objective of the analysis is to quantify the overlap of
> observed query data across the root servers, and to ultimately
> determine whether the queries observed at one server are
> representative of the queries observed across all samples, then
> refinement of sampled IP addresses to support that conclusion is
> inappropriate.
>
>
> **Concern 3: Analysis based on biased sample of non-existent TLDs.
>
> The queries for non-existent TLDs, which result in NXDOMAIN responses
> at the root servers, are compared across the root servers, to see how
> well they are represented. However, like observed IP addresses
> (Concern 1), the non-existent TLDs are limited to those corresponding
> to the most queries observed—both the top 10,000 and the top 1,000.
> This is independent of querying IP address, ASN, and other aggregating
> features, which would help better understand the diversity of the
> queries for each non-existent TLD. For example, it might be that the
> non-existent TLDs most queried for come from a small pool of IP
> addresses or networks, and others are being excluded simply because
> they are outside that sample.
>
>
> **Concern 4: TLDs considered without QNAME context.
>
> While comparisons are made to measure the representativeness of
> non-existent TLDs, one primary feature missing from the analysis is
> the QNAME. In all cases, the non-existent TLD is considered in
> isolation, yet QNAME context is shown in the analysis to be a
> significant contributor to quantifying name collisions potential (see
> Concern 7).
>
>
> **Concern 5: Query count used as comparison between recursive server
> and root servers.
>
> Because of (negative) caching at recursive servers, it is expected
> that queries observed at the root servers for a given non-existent TLD
> will be fewer than those at a recursive resolver for that same
> non-existent TLD. It is this very caching behavior that makes the
> comparison of query count for a given non-existent TLD, as observed by
> the root servers vs. a recursive resolver, an apples-to-oranges
> comparison. Yet the analysis includes a comparison of the top 1,000
> non-existent TLDs, ranked by query count. Thus, no meaningful
> conclusions can be drawn from this comparison.
>
>
> **Concern 6: Unique IP addresses used as comparison between recursive
> server and root servers.
>
> Study 2 includes source diversity when comparing the query counts for
> non-existent TLDs. There is certainly more value in investigating IP
> source diversity when considering the query counts for non-existent
> TLDs that considering query counts alone (Concern 5). However, it is
> expected that recursive resolvers serve a very different client base
> than authoritative servers, specifically the root servers. Whereas
> the former would might expect queries from stub resolvers, the latter
> might expect queries from recursive resolvers. In such a case,
> analyzing client IP addresses independently of one another leaves
> significant meaningful context out, such as the diversity of IP
> prefixes or ASNs from which queries arrive. A large number of IP
> addresses from the same IP prefix or ASN might be responsible for the
> queries associated with several “top” non-existent TLDs, excluding
> non-existent TLDs that might have non-trivial presence but do not have
> the top IP address diversity. See also Concern 7.
>
> **Concern 7: Disagrees with findings from “Case Study of Collision
> Strings”.
>
> The document “Case Study of Collision Strings”, also written in
> connection with NCAP Study 2, contains the following findings:
>
> 1. “A relatively small number of origin ASNs account for the
> vast majority of query traffic for .CORP, .HOME, and .MAIL. In all
> cases roughly 200 ASNs make up nearly 90% of the volume” (section
> 4.1.5).
>
> 2. “Label analysis provides a unique observational context into
> the underlying systems, networks, and protocols inducing leakage of
> DNS queries to the global DNS ecosystem. Understanding the diversity
> of labels can help provide a sense of how broadly disseminated the
> leakage is throughout the DNS” (section 4.2.1).
>
> 3. “The .CORP SLDs seen at both A and J (approximately 16
> thousand) is almost equal to those seen at A-root alone, but J-root
> sees over 30,000 .CORP SLDs that A-root does not see” (section
> 4.3.1).
>
> 4. “Across all names studied, while A and J saw much in common,
> there was a non-negligible amount of uniqueness to each view. For
> example, A and J each saw queries from the same 5717 originating ASNs,
> but J saw 2477 ASNs that A didn't see and A saw 901 that didn't see”
> (section 4.3.2).
>
> 5. “A more intensive and thorough analysis would include other
> root server vantage points to minimize potential bias in the A and J
> catchments” (section 5.2).
>
> 6. “Additional measurement from large recursive resolvers would
> also help elucidate any behaviors masked by negative caching and the
> population of stub resolvers” (section 5.2).
>
>
> These findings emphasize the following points, which are at odds with
> the "Perspective" document:
>
> - Including ASN (and IP prefix) in an analysis can make a
> significant difference in the overall diversity associated with
> observed queries.
>
> - There is significance in the context provided by the QNAME,
> not only in measuring diversity, but also in query representativeness
> across root servers.
>
> - Root servers—even just A and J—have a non-negligible
> amount of uniqueness that is not captured—or even addressed in this
> document.
>
> - More root servers have a greater perspective of potential name
> collisions than one.
>
> - The population of stub resolvers should be considered in the
> analysis of large recursive resolvers.
>
>
>
>
> On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss
> <ncap-discuss at icann.org<mailto:ncap-discuss at icann.org>> wrote:
>
> NCAP DG,
> As set during our last meeting on 19 January 2022, we pushed the start
> of the public comment period for “A Perspective Study of DNS Queries
> for Non-Existent Top-Level Domains” to this Thursday, 27 January
> 2022, in order to accommodate some last minute questions.
> Additionally, as previously announced, today ends the comment period
> for the release of this document.
> Attached is the FINAL DRAFT version of “A Perspective Study of DNS
> Queries for Non-Existent Top-Level Domains”. If you have any
> objections to this document being released for public comment please
> reply to this message on the list. The objection period will close at
> the end of our weekly meeting on Wednesday, 26 January 2022. Comments
> that do not substantially change our stated conclusions will be
> captured and considered after the public comment period when we will
> be reviewing all public comments received.
> A view only version of the document is
> here:https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#<https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view>
> Matt Thomas
> <Last Call A Perspective Study of DNS Queries for Non-Existent
> Top-Level Domains.pdf>_______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org<mailto:NCAP-Discuss at icann.org>
> https://mm.icann.org/mailman/listinfo/ncap-discuss
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of
> your personal data for purposes of subscribing to this mailing list
> accordance with the ICANN Privacy Policy
> (https://www.icann.org/privacy/policy) and the website Terms of
> Service (https://www.icann.org/privacy/tos). You can visit the Mailman
> link above to change your membership status or configuration,
> including unsubscribing, setting digest-style delivery or disabling
> delivery altogether (e.g., for a vacation), and so on.
>
>
>
> _______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org<mailto:NCAP-Discuss at icann.org>
> https://mm.icann.org/mailman/listinfo/ncap-discuss
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of
> your personal data for purposes of subscribing to this mailing list
> accordance with the ICANN Privacy Policy
> (https://www.icann.org/privacy/policy) and the website Terms of
> Service (https://www.icann.org/privacy/tos). You can visit the Mailman
> link above to change your membership status or configuration,
> including unsubscribing, setting digest-style delivery or disabling
> delivery altogether (e.g., for a vacation), and so on.
>
> ________________________________
>
> This message and any attachments are intended only for the use of the
> individual or entity to which they are addressed. If the reader of
> this message or an attachment is not the intended recipient or the
> employee or agent responsible for delivering the message or attachment
> to the intended recipient you are hereby notified that any
> dissemination, distribution or copying of this message or any
> attachment is strictly prohibited. If you have received this
> communication in error, please notify us immediately by replying to
> the sender. The information transmitted in this message and any
> attachments may be privileged, is intended only for the personal and
> confidential use of the intended recipients, and is covered by the
> Electronic Communications Privacy Act, 18 U.S.C. §2510-2521.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220125/7ecbfe9b/attachment-0001.html>
More information about the NCAP-Discuss
mailing list