[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains

Tue Jan 25 21:56:24 UTC 2022

Thank you for comments Tom.

Jim

On 25 Jan 2022, at 15:54, Tom Barrett wrote:

> I also appreciate the thorough analysis conducted by Matt and Casey.  
> We
> have learned a lot from the analysis but it has also uncovered some
> new questions worth exploring.
>
> We could add a catch-all or placeholder section at the end of the 
> document
> called  "Potential areas for further study".  Casey's points could be
> captured here.
>
> best regards,
>
> Tom
>
>
>
> On Tue, Jan 25, 2022 at 2:26 PM James Galvin <galvin at elistx.com> 
> wrote:
>
>> First, I have to say that we all owe a debt of gratitude to Casey for 
>> his
>> thorough and detailed review of this document. He raises some good
>> questions that should be considered and responded to directly.
>>
>> TL;DR - This question before this group is whether this document is 
>> ready
>> to be released for public comment on Thursday, 27 January 2022. 
>> Speaking as
>> a co-Chair, the question I’m considering is whether or not the Key 
>> Findings
>> of this document are at risk since, if so, the document would not be 
>> ready
>> for public comment. It is my considered opinion the Key Findings are 
>> not at
>> risk and that this document should be released for public comment on
>> Thursday. In addition, the discussion that Casey has started should
>> continue on the mailing list.
>>
>> In our 26 January 2022 meeting, absent any substantive objections, we 
>> will
>> declare that the consensus of the past several months of analysis
>> discussion is that the Discussion Group believes the document is 
>> ready to
>> be released for public comment.
>>
>> Long-winded response:
>>
>> Some of Casey’s concerns are focused on whether we have complete 
>> data.
>> This is a fair concern because we don’t have complete data. In 
>> addition, in
>> a few cases we have chosen to set aside some data sets, e.g., the 5 
>> root
>> server data sets that were excluded from the root server analysis. 
>> This is
>> an ordinary thing to do in data science. The most important thing to 
>> do is
>> to be very clear about the data you are using and note that any 
>> conclusions
>> are only based on what you know (i.e., you don’t know what you 
>> don’t know).
>> We do this.
>>
>> We also know that we will never get complete data. We have not said a 
>> lot
>> about this. So far we have noted that there are some legal 
>> constraints
>> associated with a number of parties sharing the data we do have. In 
>> fact,
>> although Verisign has used its data for some detailed analysis, for 
>> which
>> we are extremely grateful, consistent with other root server 
>> operators,
>> they have not made their data generally available to others. Public
>> Recursive Resolvers have presented the same issue to us; there was 
>> only one
>> that did some level of analysis for us and another with even more 
>> limited
>> analysis.
>>
>> As a result of both of these points, we have not done a complete and
>> thorough analysis of all possible data. Nonetheless, I do believe 
>> that our
>> conclusions are supported by the data we have.
>>
>> The first key finding is that analysis of the data at any root server
>> identifier is sufficiently representative of the root server system 
>> in
>> total. Bottom line - there is some subjectivity within this 
>> statement.
>> However, statistics provides us with methods to measure the quality 
>> of
>> comparisons and Matt Thomas has done this and presented it to us. It 
>> is as
>> good as it can be, given the data we’re working with.
>>
>> Of course, since it is not a “perfect” analysis, there is some 
>> residual
>> risk. It is essential we capture this point and explain it in our 
>> final
>> work product. In fact, if you review the text that is already under
>> development there, the point of capturing residual risk is already 
>> listed.
>>
>> The second key finding is that traffic observed at root servers is 
>> not
>> sufficiently representative of traffic at recursive resolvers. 
>> Frankly,
>> this point is self-evident. We may have incomplete data from a single
>> recursive resolver, but it nonetheless proves exactly this point. 
>> Certainly
>> there might be many things we could learn from a more complete 
>> analysis and
>> study of a more complete set of data at recursive resolvers, but none 
>> of
>> that changes the key finding that public recursive resolvers see a
>> different DNS infrastructure.
>>
>> By the way, there is also residual risk here, which is also captured 
>> in
>> final report draft text. There’s much more to say here to explain 
>> it, but
>> it too does not change the key finding.
>>
>> As Casey points out, the implications noted in these key findings are
>> subject to discussion in a broader context. These implications will 
>> be
>> brought forward to the final work product and discussed within the 
>> context
>> of the workflow we have developed.
>>
>> In summary, there are two important things to consider. First, is the 
>> data
>> analysis sufficient to support the key findings? Second, is the 
>> residual
>> risk a fundamental concern or an ordinary risk management question to 
>> be
>> considered?
>>
>> Some may be concerned they can not evaluate these issues directly
>> themselves. My suggestion is that we continue this discussion of 
>> these
>> technical issues on the mailing list. This will facilitate thoughtful 
>> and
>> detailed responses, and allow everyone the opportunity to share the
>> discussion with other experts to review the details.
>>
>> See you all tomorrow,
>>
>> Jim
>>
>>
>>
>>
>> On 25 Jan 2022, at 11:19, Casey Deccio wrote:
>>
>> My apologies that I am responding to my own email.  Someone noted to 
>> me
>> that I neglected two very important points.  First, I mentioned that 
>> I
>> didn’t agree with the conclusions of the document, but I only 
>> provided
>> my critique of the analysis, not the conclusions.  Second, I have not
>> explicitly provided any suggestions for a path forward.  Let me 
>> correct
>> that by acting on those suggestions.
>>
>>
>> Conclusions
>> ---------------
>>
>> ** Study 1 Key Observations
>> I have no disagreement with this section.  These are an accurate 
>> summary
>> of the results of Study 1.
>>
>>
>> ** Study 2 Key Observations
>> The following statement is true, but the qualifying factor “top” 
>> is based
>> on a comparison (query count and IP address diversity) that is unfair 
>> (see
>> Concerns 5 and 6).
>>
>> “Initial results from one PRR indicate there is a difference in top
>> non-existent TLDs using either query volume or source diversity
>> measurements.”
>>
>> The following two statements are generalities inferred from the 
>> previous
>> statement, and which are not supported by the data, precisely because
>> of Concerns 5 and 6.
>>
>> “Many non-existent TLDs (roughly 40%) observed at the PRR are not 
>> in the
>> top RSIs based on query volume. Nearly 30% observed at the PRR are 
>> not in
>> the top RSIs based on source diversity.”
>>
>> “… name collision strings cannot be measured or assessed properly 
>> based on
>> only using data from the RSS.”
>>
>> I agree and sympathize with the notion that there were heavy 
>> constraints
>> of privacy and data aggregation associated with the analysis of the 
>> public
>> recursive resolver data, but the comparison made thus far is an 
>> unfair
>> comparison.
>>
>>
>> ** Key Findings
>>
>> The following statement is true, but based on analysis that was 
>> performed
>> on highly biased data, specifically less than 1% of IP addresses 
>> observed
>> at the root servers and top 10,000 and 1,000 of non-existent TLDs 
>> (see
>> Concerns 1, 2, 3, 4, and 7):
>>
>> “Non-existent DNS queries for top querying and top source diversity 
>> TLDs
>> appear to be comparable and representative at any RSI.”
>>
>> (Nit: Sentence reads “Non-existent DNS queries”, but I think what 
>> is meant
>> is “DNS queries for non-existent TLDs.)
>>
>> The following statement is inconclusive because of the bias in the 
>> data
>> that was analyzed.
>>
>> “PRR data further indicates that there is a very different view of 
>> the top
>> non-existent TLDs based both on query volume and source diversity.”
>>
>> I do not *disagree* with the following statement:
>>
>> “ICANN, as the operator for the L RSI, is well-positioned to 
>> instrument,
>> collect, analyze, and disseminate name collision measurements to 
>> subsequent
>> gTLD applicants both prior to submission and during the application 
>> review.”
>>
>> But I feel like the point of the document was to motivate this with
>> “You’ve seen one, you’ve seen them all.”  And the analysis 
>> does not support
>> that, at least not as generally as it was stated.
>>
>> I do not believe that following statements are supported by the data:
>>
>> “Name collision traffic observed at the root is not sufficiently
>> representative of traffic received at recursive resolvers to 
>> guarantee a
>> complete and or accurate representation of a string’s potential 
>> name
>> collision risks and impacts.”
>>
>> “Name collision strings cannot be measured or assessed properly 
>> based on
>> only using data from the RSS. Obtaining an accurate picture of name
>> collision risks can only be obtained via delegation.”
>>
>> These might be well true, but the analysis in this document does not
>> motivate this (see Concerns 5 and 6).  There are other factors that 
>> might
>> contribute to these, some considered in this document (negative
>> caching) and others not (local root and aggressive negative caching). 
>>  But
>> my point is that the current analysis does not lead me to the 
>> conclusions
>> included in the sentences above.
>>
>>
>> Suggestions for Improvement
>> ---------------
>>
>> ** Limit conclusions in “Key Findings”.
>>
>> Study 1 Key Observations is an example of conclusions that *can* be 
>> drawn
>> from the existing analysis.  The step from these to generalities of
>> representativeness of data in “Key Findings” is where my concerns 
>> lie.  If
>> the “Key Findings” related to Study 1 are honed to be in scope 
>> with the
>> analysis, their impact might be significantly less, but it gives me 
>> less
>> concern.
>>
>>
>> ** Revamp Root Server Analysis.
>> If the purpose of the document is to come to more general 
>> conclusions,
>> such as those previously mentioned, then the analysis needs to be 
>> revamped:
>> -       Rather than selecting biased data (i.e., top talkers), the 
>> data
>> must be representative of IP prefix, ASN, IP version.  Those are 
>> completely
>> missing from the current analysis.
>> -       Before any filtering is done, a comprehensive analysis should 
>> be
>> performed, with all IP addresses.  Even if the conclusion is that a 
>> filter
>> of some sort is appropriate, and a representative set can be yielded 
>> with
>> such filtering, there is no comparison given, and the reader is 
>> simply left
>> to make that leap of faith.
>> -       Any filtering should not limit the analysis to the IP 
>> addresses
>> with the most queries—certainly not the top 1%.  Some filter might 
>> by query
>> count might be fine, but it should be a low bar, and it should be 
>> justified
>> by behavior and representation (IP prefix, ASN, IP version).
>>
>>
>> ** Revamp Comparison of Root Server Queries and Public Recursive 
>> Resolver
>> I understand that there are data constraints within which the 
>> recursive
>> data must be analyzed, but there are analyses that *can* be done, 
>> even
>> within those constraints.  For example, rather than sorting by top
>> query count and IP address diversity, start with the complete set of
>> non-existent TLDs and (if the data includes it) full QNAME diversity, 
>> or at
>> least SLD diversity.  As it is, the comparison of root server data 
>> and
>> public recursive resolver data is unfair and therefore does not 
>> provide
>> substance.
>>
>>
>> ** Accept Data-Driven Conclusions
>> There might be some desirable conclusions that the data simply does 
>> not
>> support.  I have little concern with listing conclusions that are not 
>> what
>> was anticipated beforehand (i.e., hypotheses).  Even very caveated
>> conclusions coming from an analysis can be enough to inform 
>> decisions, if
>> those caveats are considered for what they are worth.  I
>> have great concern, however, with conclusions that are not 
>> data-driven
>>
>>
>> On Jan 24, 2022, at 11:25 PM, Casey Deccio <casey at deccio.net> wrote:
>>
>> Dear all,
>>
>> I have taken the time to study the “Perspective” document, as 
>> well as the
>> document “Case Study of Collision Strings”, which is also being 
>> produced by
>> NCAP in connection with Study 2.  I appreciate all the time and 
>> effort that
>> has gone into the analysis contained in “A Perspective Study of DNS 
>> Queries
>> for Non-Existent Top-Level Domains”.  I know that it has required 
>> no small
>> effort.
>>
>> Nonetheless, I have fundamental concerns about the analysis contained 
>> in
>> “Perspective”, and I also do not agree with the conclusions that 
>> are
>> drawn from the analysis.  Additionally, I find the analysis and 
>> conclusions
>> in “Perspective” to be at odds with those contained in “Case
>> Study”.  Finally, I believe my concerns to be substantial enough 
>> that they
>> cannot be corrected with minor edits, and I *do not* support the 
>> document
>> moving forward.  I herein detail my concerns.
>>
>> Sincerely,
>> Casey
>>
>>
>>
>> Summary:
>> Concern 1: Analysis based on biased sample of querying IP addresses.
>> Concern 2: Sample data refined to support the conclusion.
>> Concern 3: Analysis based on biased sample of non-existent TLDs.
>> Concern 4:  TLDs considered without QNAME context.
>> Concern 5: Query count used as comparison between recursive server 
>> and
>> root servers.
>> Concern 6: Unique IP addresses used as comparison between recursive 
>> server
>> and root servers.
>> Concern 7: Disagrees with findings from “Case Study of Collision 
>> Strings”.
>>
>>
>> Details:
>>
>>
>> **Concern 1: Analysis based on biased sample of querying IP 
>> addresses.
>>
>> The sample on which the analysis and conclusions are based is 
>> selected
>> exclusively by proportion of queries observed during the
>> collection period.  Specifically, fewer than 1% (0.67% or 115K) of 
>> the 17M
>> IP addresses (15.51M IPv4 and 1.56M IPv6) observed across the 2020 
>> DITL are
>> considered for the analysis—those producing the most queries (90% 
>> of the
>> DITL data); that excludes 99% of IP addresses from analysis.  Because 
>> the
>> set of “top talker” IP addresses is selected based only on the 
>> volume of
>> traffic, it is severely biased and is not necessarily representative 
>> of
>> resolvers world-wide.  Those that query most—for whatever
>> reasons—are the loudest, and without further examination, it’s 
>> hard to even
>> know why. The concern is not even just whether or not it is okay to
>> exclude non-top-talkers, but whether top-talkers are themselves an
>> appropriate representation.  Other metrics that could be used to
>> quantify network representation for the selection process and/or 
>> analysis
>> of top-talkers are missing from the analysis, including IP prefix
>> (e.g., /16, /24, /48, /64), ASN, and even IP version.  See
>> also Concern 7 for more.
>>
>> The analysis in Annex 2 is very interesting, but does not, by itself,
>> resolve this concern.  The annex provides some very helpful lists of 
>> top
>> queries from the few-queries resulting in NXDOMAIN responses, and 
>> there are
>> some comparisons of the percentage of queries of resulting in 
>> NXDOMAIN
>> responses for the total given number of queries, but even those are
>> difficult to assess without a full behavioral analysis.
>>
>>
>> **Concern 2: Sample data refined to support the conclusion.
>>
>> While the original sample is already of questionable representation 
>> (less
>> than 1% of IP addresses observed, based solely on query volume), that
>> dataset is further refined, according to the following text (i.e., 
>> from the
>> document):
>>
>>  “On average, each RSI observed 96% of the top talkers that account 
>> for
>> 90% of total traffic.  That percentage drops to 94% when using the 
>> 95th
>> percentile top talkers. Based on these findings, only the 90th
>> percentile top talkers were used for the remaining measurements in 
>> this
>> study.”
>>
>> If the objective of the analysis is to quantify the overlap of 
>> observed
>> query data across the root servers, and to ultimately determine 
>> whether the
>> queries observed at one server are representative of the queries
>> observed across all samples, then refinement of sampled IP addresses 
>> to
>> support that conclusion is inappropriate.
>>
>>
>> **Concern 3: Analysis based on biased sample of non-existent TLDs.
>>
>> The queries for non-existent TLDs, which result in NXDOMAIN responses 
>> at
>> the root servers, are compared across the root servers, to see how 
>> well
>> they are represented.  However, like observed IP addresses (Concern 
>> 1), the
>> non-existent TLDs are limited to those corresponding to the most 
>> queries
>> observed—both the top 10,000 and the top 1,000.  This is 
>> independent of
>> querying IP address, ASN, and other aggregating features, which would 
>> help
>> better understand the diversity of the queries for each non-existent
>> TLD.  For example, it might be that the non-existent TLDs most 
>> queried for
>> come from a small pool of IP addresses or networks, and others are 
>> being
>> excluded simply because they are outside that sample.
>>
>>
>> **Concern 4:  TLDs considered without QNAME context.
>>
>> While comparisons are made to measure the representativeness of
>> non-existent TLDs, one primary feature missing from the analysis is 
>> the
>> QNAME.  In all cases, the non-existent TLD is considered in 
>> isolation, yet
>> QNAME context is shown in the analysis to be a significant 
>> contributor to
>> quantifying name collisions potential (see Concern 7).
>>
>>
>> **Concern 5: Query count used as comparison between recursive server 
>> and
>> root servers.
>>
>> Because of (negative) caching at recursive servers, it is expected 
>> that
>> queries observed at the root servers for a given non-existent TLD 
>> will be
>> fewer than those at a recursive resolver for that
>> same non-existent TLD.  It is this very caching behavior that
>> makes the comparison of query count for a given non-existent TLD, as
>> observed by the root servers vs. a recursive resolver, an 
>> apples-to-oranges
>> comparison.  Yet the analysis includes a comparison of the top 1,000
>> non-existent TLDs, ranked by query count.  Thus, no meaningful 
>> conclusions
>> can be drawn from this comparison.
>>
>>
>> **Concern 6: Unique IP addresses used as comparison between recursive
>> server and root servers.
>>
>> Study 2 includes source diversity when comparing the query counts for
>> non-existent TLDs.  There is certainly more value in investigating IP
>> source diversity when considering the query counts for non-existent 
>> TLDs
>> that considering query counts alone (Concern 5).  However, it is 
>> expected
>> that recursive resolvers serve a very different client base than
>> authoritative servers, specifically the root servers.  Whereas the
>> former would might expect queries from stub resolvers, the latter 
>> might
>> expect queries from recursive resolvers.  In such a case, analyzing 
>> client
>> IP addresses independently of one another leaves significant 
>> meaningful
>> context out, such as the diversity of IP prefixes or ASNs from
>> which queries arrive.  A large number of IP addresses from the same 
>> IP
>> prefix or ASN might be responsible for the queries associated with 
>> several
>> “top” non-existent TLDs, excluding non-existent TLDs that
>> might have non-trivial presence but do not have the top IP address
>> diversity.  See also Concern 7.
>>
>>
>> **Concern 7: Disagrees with findings from “Case Study of Collision
>> Strings”.
>>
>> The document “Case Study of Collision Strings”, also written in 
>> connection
>> with NCAP Study 2, contains the following findings:
>>
>> 1.     “A relatively small number of origin ASNs account for the 
>> vast
>> majority of query traffic for .CORP, .HOME, and .MAIL. In all cases 
>> roughly
>> 200 ASNs make up nearly 90% of the volume” (section 4.1.5).
>>
>> 2.     “Label analysis provides a unique observational context into 
>> the
>> underlying systems, networks, and protocols inducing leakage of DNS 
>> queries
>> to the global DNS ecosystem. Understanding the diversity of labels 
>> can help
>> provide a sense of how broadly disseminated the leakage is throughout 
>> the
>> DNS” (section 4.2.1).
>>
>> 3.     “The .CORP SLDs seen at both A and J (approximately 16 
>> thousand) is
>> almost equal to those seen at A-root alone, but J-root sees over 
>> 30,000
>> .CORP SLDs that A-root does not see” (section 4.3.1).
>>
>> 4.     “Across all names studied, while A and J saw much in common, 
>> there
>> was a non-negligible amount of uniqueness to each view. For example, 
>> A and
>> J each saw queries from the same 5717 originating ASNs, but J saw 
>> 2477 ASNs
>> that A didn't see and A saw 901 that didn't see” (section 4.3.2).
>>
>> 5.     “A more intensive and thorough analysis would include other 
>> root
>> server vantage points to minimize potential bias in the A and J 
>> catchments”
>> (section 5.2).
>>
>> 6.     “Additional measurement from large recursive resolvers would 
>> also
>> help elucidate any behaviors masked by negative caching and the 
>> population
>> of stub resolvers” (section 5.2).
>>
>>
>> These findings emphasize the following points, which are at odds with 
>> the
>> "Perspective" document:
>>
>> -       Including ASN (and IP prefix) in an analysis can make a
>> significant difference in the overall diversity associated with
>> observed queries.
>>
>> -       There is significance in the context provided by the QNAME, 
>> not
>> only in measuring diversity, but also in query representativeness 
>> across
>> root servers.
>>
>> -       Root servers—even just A and J—have a non-negligible 
>> amount of
>> uniqueness that is not captured—or even addressed in this document.
>>
>> -       More root servers have a greater perspective of potential 
>> name
>> collisions than one.
>>
>> -       The population of stub resolvers should be considered in the
>> analysis of large recursive resolvers.
>>
>>
>>
>> On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss <
>> ncap-discuss at icann.org> wrote:
>>
>> NCAP DG,
>> As set during our last meeting on 19 January 2022, we pushed the 
>> start of
>> the public comment period for “A Perspective Study of DNS Queries 
>> for
>> Non-Existent Top-Level Domains” to this Thursday, 27 January 2022, 
>> in order
>> to accommodate some last minute questions. Additionally, as 
>> previously
>> announced, today ends the comment period for the release of this 
>> document.
>> Attached is the FINAL DRAFT version of “A Perspective Study of DNS 
>> Queries
>> for Non-Existent Top-Level Domains”. If you have any objections to 
>> this
>> document being released for public comment please reply to this 
>> message on
>> the list. The objection period will close at the end of our weekly 
>> meeting
>> on Wednesday, 26 January 2022. Comments that do not substantially 
>> change
>> our stated conclusions will be captured and considered after the 
>> public
>> comment period when we will be reviewing all public comments 
>> received.
>> A view only version of the document is here:
>> https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#
>> Matt Thomas
>> <Last Call A Perspective Study of DNS Queries for Non-Existent 
>> Top-Level
>> Domains.pdf>_______________________________________________
>> NCAP-Discuss mailing list
>> NCAP-Discuss at icann.org
>> https://mm.icann.org/mailman/listinfo/ncap-discuss
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of 
>> your
>> personal data for purposes of subscribing to this mailing list 
>> accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) 
>> and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You 
>> can
>> visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery 
>> or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>>
>>
>>
>> _______________________________________________
>> NCAP-Discuss mailing list
>> NCAP-Discuss at icann.org
>> https://mm.icann.org/mailman/listinfo/ncap-discuss
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of 
>> your
>> personal data for purposes of subscribing to this mailing list 
>> accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) 
>> and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You 
>> can
>> visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery 
>> or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>>
>> _______________________________________________
>> NCAP-Discuss mailing list
>> NCAP-Discuss at icann.org
>> https://mm.icann.org/mailman/listinfo/ncap-discuss
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of 
>> your
>> personal data for purposes of subscribing to this mailing list 
>> accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) 
>> and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You 
>> can
>> visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery 
>> or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>
>
>
> -- 
> Thomas Barrett
> President
> EnCirca, Inc
> +1.781.942.9975 (office)
> 400 W. Cummings Park, Suite 1725
> Woburn, MA 01801 USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220125/e05889c8/attachment-0001.html>