[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains

Tue Jan 25 20:54:50 UTC 2022

I also appreciate the thorough analysis conducted by Matt and Casey.  We
have learned a lot from the analysis but it has also uncovered some
new questions worth exploring.

We could add a catch-all or placeholder section at the end of the document
called  "Potential areas for further study".  Casey's points could be
captured here.

best regards,

Tom

On Tue, Jan 25, 2022 at 2:26 PM James Galvin <galvin at elistx.com> wrote:

> First, I have to say that we all owe a debt of gratitude to Casey for his
> thorough and detailed review of this document. He raises some good
> questions that should be considered and responded to directly.
>
> TL;DR - This question before this group is whether this document is ready
> to be released for public comment on Thursday, 27 January 2022. Speaking as
> a co-Chair, the question I’m considering is whether or not the Key Findings
> of this document are at risk since, if so, the document would not be ready
> for public comment. It is my considered opinion the Key Findings are not at
> risk and that this document should be released for public comment on
> Thursday. In addition, the discussion that Casey has started should
> continue on the mailing list.
>
> In our 26 January 2022 meeting, absent any substantive objections, we will
> declare that the consensus of the past several months of analysis
> discussion is that the Discussion Group believes the document is ready to
> be released for public comment.
>
> Long-winded response:
>
> Some of Casey’s concerns are focused on whether we have complete data.
> This is a fair concern because we don’t have complete data. In addition, in
> a few cases we have chosen to set aside some data sets, e.g., the 5 root
> server data sets that were excluded from the root server analysis. This is
> an ordinary thing to do in data science. The most important thing to do is
> to be very clear about the data you are using and note that any conclusions
> are only based on what you know (i.e., you don’t know what you don’t know).
> We do this.
>
> We also know that we will never get complete data. We have not said a lot
> about this. So far we have noted that there are some legal constraints
> associated with a number of parties sharing the data we do have. In fact,
> although Verisign has used its data for some detailed analysis, for which
> we are extremely grateful, consistent with other root server operators,
> they have not made their data generally available to others. Public
> Recursive Resolvers have presented the same issue to us; there was only one
> that did some level of analysis for us and another with even more limited
> analysis.
>
> As a result of both of these points, we have not done a complete and
> thorough analysis of all possible data. Nonetheless, I do believe that our
> conclusions are supported by the data we have.
>
> The first key finding is that analysis of the data at any root server
> identifier is sufficiently representative of the root server system in
> total. Bottom line - there is some subjectivity within this statement.
> However, statistics provides us with methods to measure the quality of
> comparisons and Matt Thomas has done this and presented it to us. It is as
> good as it can be, given the data we’re working with.
>
> Of course, since it is not a “perfect” analysis, there is some residual
> risk. It is essential we capture this point and explain it in our final
> work product. In fact, if you review the text that is already under
> development there, the point of capturing residual risk is already listed.
>
> The second key finding is that traffic observed at root servers is not
> sufficiently representative of traffic at recursive resolvers. Frankly,
> this point is self-evident. We may have incomplete data from a single
> recursive resolver, but it nonetheless proves exactly this point. Certainly
> there might be many things we could learn from a more complete analysis and
> study of a more complete set of data at recursive resolvers, but none of
> that changes the key finding that public recursive resolvers see a
> different DNS infrastructure.
>
> By the way, there is also residual risk here, which is also captured in
> final report draft text. There’s much more to say here to explain it, but
> it too does not change the key finding.
>
> As Casey points out, the implications noted in these key findings are
> subject to discussion in a broader context. These implications will be
> brought forward to the final work product and discussed within the context
> of the workflow we have developed.
>
> In summary, there are two important things to consider. First, is the data
> analysis sufficient to support the key findings? Second, is the residual
> risk a fundamental concern or an ordinary risk management question to be
> considered?
>
> Some may be concerned they can not evaluate these issues directly
> themselves. My suggestion is that we continue this discussion of these
> technical issues on the mailing list. This will facilitate thoughtful and
> detailed responses, and allow everyone the opportunity to share the
> discussion with other experts to review the details.
>
> See you all tomorrow,
>
> Jim
>
>
>
>
> On 25 Jan 2022, at 11:19, Casey Deccio wrote:
>
> My apologies that I am responding to my own email.  Someone noted to me
> that I neglected two very important points.  First, I mentioned that I
> didn’t agree with the conclusions of the document, but I only provided
> my critique of the analysis, not the conclusions.  Second, I have not
> explicitly provided any suggestions for a path forward.  Let me correct
> that by acting on those suggestions.
>
>
> Conclusions
> ---------------
>
> ** Study 1 Key Observations
> I have no disagreement with this section.  These are an accurate summary
> of the results of Study 1.
>
>
> ** Study 2 Key Observations
> The following statement is true, but the qualifying factor “top” is based
> on a comparison (query count and IP address diversity) that is unfair (see
> Concerns 5 and 6).
>
> “Initial results from one PRR indicate there is a difference in top
> non-existent TLDs using either query volume or source diversity
> measurements.”
>
> The following two statements are generalities inferred from the previous
> statement, and which are not supported by the data, precisely because
> of Concerns 5 and 6.
>
> “Many non-existent TLDs (roughly 40%) observed at the PRR are not in the
> top RSIs based on query volume. Nearly 30% observed at the PRR are not in
> the top RSIs based on source diversity.”
>
> “… name collision strings cannot be measured or assessed properly based on
> only using data from the RSS.”
>
> I agree and sympathize with the notion that there were heavy constraints
> of privacy and data aggregation associated with the analysis of the public
> recursive resolver data, but the comparison made thus far is an unfair
> comparison.
>
>
> ** Key Findings
>
> The following statement is true, but based on analysis that was performed
> on highly biased data, specifically less than 1% of IP addresses observed
> at the root servers and top 10,000 and 1,000 of non-existent TLDs (see
> Concerns 1, 2, 3, 4, and 7):
>
> “Non-existent DNS queries for top querying and top source diversity TLDs
> appear to be comparable and representative at any RSI.”
>
> (Nit: Sentence reads “Non-existent DNS queries”, but I think what is meant
> is “DNS queries for non-existent TLDs.)
>
> The following statement is inconclusive because of the bias in the data
> that was analyzed.
>
> “PRR data further indicates that there is a very different view of the top
> non-existent TLDs based both on query volume and source diversity.”
>
> I do not *disagree* with the following statement:
>
> “ICANN, as the operator for the L RSI, is well-positioned to instrument,
> collect, analyze, and disseminate name collision measurements to subsequent
> gTLD applicants both prior to submission and during the application review.”
>
> But I feel like the point of the document was to motivate this with
> “You’ve seen one, you’ve seen them all.”  And the analysis does not support
> that, at least not as generally as it was stated.
>
> I do not believe that following statements are supported by the data:
>
> “Name collision traffic observed at the root is not sufficiently
> representative of traffic received at recursive resolvers to guarantee a
> complete and or accurate representation of a string’s potential name
> collision risks and impacts.”
>
> “Name collision strings cannot be measured or assessed properly based on
> only using data from the RSS. Obtaining an accurate picture of name
> collision risks can only be obtained via delegation.”
>
> These might be well true, but the analysis in this document does not
> motivate this (see Concerns 5 and 6).  There are other factors that might
> contribute to these, some considered in this document (negative
> caching) and others not (local root and aggressive negative caching).  But
> my point is that the current analysis does not lead me to the conclusions
> included in the sentences above.
>
>
> Suggestions for Improvement
> ---------------
>
> ** Limit conclusions in “Key Findings”.
>
> Study 1 Key Observations is an example of conclusions that *can* be drawn
> from the existing analysis.  The step from these to generalities of
> representativeness of data in “Key Findings” is where my concerns lie.  If
> the “Key Findings” related to Study 1 are honed to be in scope with the
> analysis, their impact might be significantly less, but it gives me less
> concern.
>
>
> ** Revamp Root Server Analysis.
> If the purpose of the document is to come to more general conclusions,
> such as those previously mentioned, then the analysis needs to be revamped:
> -       Rather than selecting biased data (i.e., top talkers), the data
> must be representative of IP prefix, ASN, IP version.  Those are completely
> missing from the current analysis.
> -       Before any filtering is done, a comprehensive analysis should be
> performed, with all IP addresses.  Even if the conclusion is that a filter
> of some sort is appropriate, and a representative set can be yielded with
> such filtering, there is no comparison given, and the reader is simply left
> to make that leap of faith.
> -       Any filtering should not limit the analysis to the IP addresses
> with the most queries—certainly not the top 1%.  Some filter might by query
> count might be fine, but it should be a low bar, and it should be justified
> by behavior and representation (IP prefix, ASN, IP version).
>
>
> ** Revamp Comparison of Root Server Queries and Public Recursive Resolver
> I understand that there are data constraints within which the recursive
> data must be analyzed, but there are analyses that *can* be done, even
> within those constraints.  For example, rather than sorting by top
> query count and IP address diversity, start with the complete set of
> non-existent TLDs and (if the data includes it) full QNAME diversity, or at
> least SLD diversity.  As it is, the comparison of root server data and
> public recursive resolver data is unfair and therefore does not provide
> substance.
>
>
> ** Accept Data-Driven Conclusions
> There might be some desirable conclusions that the data simply does not
> support.  I have little concern with listing conclusions that are not what
> was anticipated beforehand (i.e., hypotheses).  Even very caveated
> conclusions coming from an analysis can be enough to inform decisions, if
> those caveats are considered for what they are worth.  I
> have great concern, however, with conclusions that are not data-driven
>
>
> On Jan 24, 2022, at 11:25 PM, Casey Deccio <casey at deccio.net> wrote:
>
> Dear all,
>
> I have taken the time to study the “Perspective” document, as well as the
> document “Case Study of Collision Strings”, which is also being produced by
> NCAP in connection with Study 2.  I appreciate all the time and effort that
> has gone into the analysis contained in “A Perspective Study of DNS Queries
> for Non-Existent Top-Level Domains”.  I know that it has required no small
> effort.
>
> Nonetheless, I have fundamental concerns about the analysis contained in
> “Perspective”, and I also do not agree with the conclusions that are
> drawn from the analysis.  Additionally, I find the analysis and conclusions
> in “Perspective” to be at odds with those contained in “Case
> Study”.  Finally, I believe my concerns to be substantial enough that they
> cannot be corrected with minor edits, and I *do not* support the document
> moving forward.  I herein detail my concerns.
>
> Sincerely,
> Casey
>
>
>
> Summary:
> Concern 1: Analysis based on biased sample of querying IP addresses.
> Concern 2: Sample data refined to support the conclusion.
> Concern 3: Analysis based on biased sample of non-existent TLDs.
> Concern 4:  TLDs considered without QNAME context.
> Concern 5: Query count used as comparison between recursive server and
> root servers.
> Concern 6: Unique IP addresses used as comparison between recursive server
> and root servers.
> Concern 7: Disagrees with findings from “Case Study of Collision Strings”.
>
>
> Details:
>
>
> **Concern 1: Analysis based on biased sample of querying IP addresses.
>
> The sample on which the analysis and conclusions are based is selected
> exclusively by proportion of queries observed during the
> collection period.  Specifically, fewer than 1% (0.67% or 115K) of the 17M
> IP addresses (15.51M IPv4 and 1.56M IPv6) observed across the 2020 DITL are
> considered for the analysis—those producing the most queries (90% of the
> DITL data); that excludes 99% of IP addresses from analysis.  Because the
> set of “top talker” IP addresses is selected based only on the volume of
> traffic, it is severely biased and is not necessarily representative of
> resolvers world-wide.  Those that query most—for whatever
> reasons—are the loudest, and without further examination, it’s hard to even
> know why. The concern is not even just whether or not it is okay to
> exclude non-top-talkers, but whether top-talkers are themselves an
> appropriate representation.  Other metrics that could be used to
> quantify network representation for the selection process and/or analysis
> of top-talkers are missing from the analysis, including IP prefix
> (e.g., /16, /24, /48, /64), ASN, and even IP version.  See
> also Concern 7 for more.
>
> The analysis in Annex 2 is very interesting, but does not, by itself,
> resolve this concern.  The annex provides some very helpful lists of top
> queries from the few-queries resulting in NXDOMAIN responses, and there are
> some comparisons of the percentage of queries of resulting in NXDOMAIN
> responses for the total given number of queries, but even those are
> difficult to assess without a full behavioral analysis.
>
>
> **Concern 2: Sample data refined to support the conclusion.
>
> While the original sample is already of questionable representation (less
> than 1% of IP addresses observed, based solely on query volume), that
> dataset is further refined, according to the following text (i.e., from the
> document):
>
>  “On average, each RSI observed 96% of the top talkers that account for
> 90% of total traffic.  That percentage drops to 94% when using the 95th
> percentile top talkers. Based on these findings, only the 90th
> percentile top talkers were used for the remaining measurements in this
> study.”
>
> If the objective of the analysis is to quantify the overlap of observed
> query data across the root servers, and to ultimately determine whether the
> queries observed at one server are representative of the queries
> observed across all samples, then refinement of sampled IP addresses to
> support that conclusion is inappropriate.
>
>
> **Concern 3: Analysis based on biased sample of non-existent TLDs.
>
> The queries for non-existent TLDs, which result in NXDOMAIN responses at
> the root servers, are compared across the root servers, to see how well
> they are represented.  However, like observed IP addresses (Concern 1), the
> non-existent TLDs are limited to those corresponding to the most queries
> observed—both the top 10,000 and the top 1,000.  This is independent of
> querying IP address, ASN, and other aggregating features, which would help
> better understand the diversity of the queries for each non-existent
> TLD.  For example, it might be that the non-existent TLDs most queried for
> come from a small pool of IP addresses or networks, and others are being
> excluded simply because they are outside that sample.
>
>
> **Concern 4:  TLDs considered without QNAME context.
>
> While comparisons are made to measure the representativeness of
> non-existent TLDs, one primary feature missing from the analysis is the
> QNAME.  In all cases, the non-existent TLD is considered in isolation, yet
> QNAME context is shown in the analysis to be a significant contributor to
> quantifying name collisions potential (see Concern 7).
>
>
> **Concern 5: Query count used as comparison between recursive server and
> root servers.
>
> Because of (negative) caching at recursive servers, it is expected that
> queries observed at the root servers for a given non-existent TLD will be
> fewer than those at a recursive resolver for that
> same non-existent TLD.  It is this very caching behavior that
> makes the comparison of query count for a given non-existent TLD, as
> observed by the root servers vs. a recursive resolver, an apples-to-oranges
> comparison.  Yet the analysis includes a comparison of the top 1,000
> non-existent TLDs, ranked by query count.  Thus, no meaningful conclusions
> can be drawn from this comparison.
>
>
> **Concern 6: Unique IP addresses used as comparison between recursive
> server and root servers.
>
> Study 2 includes source diversity when comparing the query counts for
> non-existent TLDs.  There is certainly more value in investigating IP
> source diversity when considering the query counts for non-existent TLDs
> that considering query counts alone (Concern 5).  However, it is expected
> that recursive resolvers serve a very different client base than
> authoritative servers, specifically the root servers.  Whereas the
> former would might expect queries from stub resolvers, the latter might
> expect queries from recursive resolvers.  In such a case, analyzing client
> IP addresses independently of one another leaves significant meaningful
> context out, such as the diversity of IP prefixes or ASNs from
> which queries arrive.  A large number of IP addresses from the same IP
> prefix or ASN might be responsible for the queries associated with several
> “top” non-existent TLDs, excluding non-existent TLDs that
> might have non-trivial presence but do not have the top IP address
> diversity.  See also Concern 7.
>
>
> **Concern 7: Disagrees with findings from “Case Study of Collision
> Strings”.
>
> The document “Case Study of Collision Strings”, also written in connection
> with NCAP Study 2, contains the following findings:
>
> 1.     “A relatively small number of origin ASNs account for the vast
> majority of query traffic for .CORP, .HOME, and .MAIL. In all cases roughly
> 200 ASNs make up nearly 90% of the volume” (section 4.1.5).
>
> 2.     “Label analysis provides a unique observational context into the
> underlying systems, networks, and protocols inducing leakage of DNS queries
> to the global DNS ecosystem. Understanding the diversity of labels can help
> provide a sense of how broadly disseminated the leakage is throughout the
> DNS” (section 4.2.1).
>
> 3.     “The .CORP SLDs seen at both A and J (approximately 16 thousand) is
> almost equal to those seen at A-root alone, but J-root sees over 30,000
> .CORP SLDs that A-root does not see” (section 4.3.1).
>
> 4.     “Across all names studied, while A and J saw much in common, there
> was a non-negligible amount of uniqueness to each view. For example, A and
> J each saw queries from the same 5717 originating ASNs, but J saw 2477 ASNs
> that A didn't see and A saw 901 that didn't see” (section 4.3.2).
>
> 5.     “A more intensive and thorough analysis would include other root
> server vantage points to minimize potential bias in the A and J catchments”
> (section 5.2).
>
> 6.     “Additional measurement from large recursive resolvers would also
> help elucidate any behaviors masked by negative caching and the population
> of stub resolvers” (section 5.2).
>
>
> These findings emphasize the following points, which are at odds with the
> "Perspective" document:
>
> -       Including ASN (and IP prefix) in an analysis can make a
> significant difference in the overall diversity associated with
> observed queries.
>
> -       There is significance in the context provided by the QNAME, not
> only in measuring diversity, but also in query representativeness across
> root servers.
>
> -       Root servers—even just A and J—have a non-negligible amount of
> uniqueness that is not captured—or even addressed in this document.
>
> -       More root servers have a greater perspective of potential name
> collisions than one.
>
> -       The population of stub resolvers should be considered in the
> analysis of large recursive resolvers.
>
>
>
> On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss <
> ncap-discuss at icann.org> wrote:
>
> NCAP DG,
> As set during our last meeting on 19 January 2022, we pushed the start of
> the public comment period for “A Perspective Study of DNS Queries for
> Non-Existent Top-Level Domains” to this Thursday, 27 January 2022, in order
> to accommodate some last minute questions. Additionally, as previously
> announced, today ends the comment period for the release of this document.
> Attached is the FINAL DRAFT version of “A Perspective Study of DNS Queries
> for Non-Existent Top-Level Domains”. If you have any objections to this
> document being released for public comment please reply to this message on
> the list. The objection period will close at the end of our weekly meeting
> on Wednesday, 26 January 2022. Comments that do not substantially change
> our stated conclusions will be captured and considered after the public
> comment period when we will be reviewing all public comments received.
> A view only version of the document is here:
> https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#
> Matt Thomas
> <Last Call A Perspective Study of DNS Queries for Non-Existent Top-Level
> Domains.pdf>_______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org
> https://mm.icann.org/mailman/listinfo/ncap-discuss
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of your
> personal data for purposes of subscribing to this mailing list accordance
> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
> the website Terms of Service (https://www.icann.org/privacy/tos). You can
> visit the Mailman link above to change your membership status or
> configuration, including unsubscribing, setting digest-style delivery or
> disabling delivery altogether (e.g., for a vacation), and so on.
>
>
>
> _______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org
> https://mm.icann.org/mailman/listinfo/ncap-discuss
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of your
> personal data for purposes of subscribing to this mailing list accordance
> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
> the website Terms of Service (https://www.icann.org/privacy/tos). You can
> visit the Mailman link above to change your membership status or
> configuration, including unsubscribing, setting digest-style delivery or
> disabling delivery altogether (e.g., for a vacation), and so on.
>
> _______________________________________________
> NCAP-Discuss mailing list
> NCAP-Discuss at icann.org
> https://mm.icann.org/mailman/listinfo/ncap-discuss
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of your
> personal data for purposes of subscribing to this mailing list accordance
> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
> the website Terms of Service (https://www.icann.org/privacy/tos). You can
> visit the Mailman link above to change your membership status or
> configuration, including unsubscribing, setting digest-style delivery or
> disabling delivery altogether (e.g., for a vacation), and so on.

-- 
Thomas Barrett
President
EnCirca, Inc
+1.781.942.9975 (office)
400 W. Cummings Park, Suite 1725
Woburn, MA 01801 USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220125/28e97bd6/attachment-0001.html>