[NCAP-Discuss] Last call for A Perspective Study of DNS Queries for Non-Existent Top-Level Domains
James Galvin
galvin at elistx.com
Tue Jan 25 21:56:24 UTC 2022
Thank you for comments Tom.
Jim
On 25 Jan 2022, at 15:54, Tom Barrett wrote:
> I also appreciate the thorough analysis conducted by Matt and Casey.
> We
> have learned a lot from the analysis but it has also uncovered some
> new questions worth exploring.
>
> We could add a catch-all or placeholder section at the end of the
> document
> called "Potential areas for further study". Casey's points could be
> captured here.
>
> best regards,
>
> Tom
>
>
>
> On Tue, Jan 25, 2022 at 2:26 PM James Galvin <galvin at elistx.com>
> wrote:
>
>> First, I have to say that we all owe a debt of gratitude to Casey for
>> his
>> thorough and detailed review of this document. He raises some good
>> questions that should be considered and responded to directly.
>>
>> TL;DR - This question before this group is whether this document is
>> ready
>> to be released for public comment on Thursday, 27 January 2022.
>> Speaking as
>> a co-Chair, the question I’m considering is whether or not the Key
>> Findings
>> of this document are at risk since, if so, the document would not be
>> ready
>> for public comment. It is my considered opinion the Key Findings are
>> not at
>> risk and that this document should be released for public comment on
>> Thursday. In addition, the discussion that Casey has started should
>> continue on the mailing list.
>>
>> In our 26 January 2022 meeting, absent any substantive objections, we
>> will
>> declare that the consensus of the past several months of analysis
>> discussion is that the Discussion Group believes the document is
>> ready to
>> be released for public comment.
>>
>> Long-winded response:
>>
>> Some of Casey’s concerns are focused on whether we have complete
>> data.
>> This is a fair concern because we don’t have complete data. In
>> addition, in
>> a few cases we have chosen to set aside some data sets, e.g., the 5
>> root
>> server data sets that were excluded from the root server analysis.
>> This is
>> an ordinary thing to do in data science. The most important thing to
>> do is
>> to be very clear about the data you are using and note that any
>> conclusions
>> are only based on what you know (i.e., you don’t know what you
>> don’t know).
>> We do this.
>>
>> We also know that we will never get complete data. We have not said a
>> lot
>> about this. So far we have noted that there are some legal
>> constraints
>> associated with a number of parties sharing the data we do have. In
>> fact,
>> although Verisign has used its data for some detailed analysis, for
>> which
>> we are extremely grateful, consistent with other root server
>> operators,
>> they have not made their data generally available to others. Public
>> Recursive Resolvers have presented the same issue to us; there was
>> only one
>> that did some level of analysis for us and another with even more
>> limited
>> analysis.
>>
>> As a result of both of these points, we have not done a complete and
>> thorough analysis of all possible data. Nonetheless, I do believe
>> that our
>> conclusions are supported by the data we have.
>>
>> The first key finding is that analysis of the data at any root server
>> identifier is sufficiently representative of the root server system
>> in
>> total. Bottom line - there is some subjectivity within this
>> statement.
>> However, statistics provides us with methods to measure the quality
>> of
>> comparisons and Matt Thomas has done this and presented it to us. It
>> is as
>> good as it can be, given the data we’re working with.
>>
>> Of course, since it is not a “perfect” analysis, there is some
>> residual
>> risk. It is essential we capture this point and explain it in our
>> final
>> work product. In fact, if you review the text that is already under
>> development there, the point of capturing residual risk is already
>> listed.
>>
>> The second key finding is that traffic observed at root servers is
>> not
>> sufficiently representative of traffic at recursive resolvers.
>> Frankly,
>> this point is self-evident. We may have incomplete data from a single
>> recursive resolver, but it nonetheless proves exactly this point.
>> Certainly
>> there might be many things we could learn from a more complete
>> analysis and
>> study of a more complete set of data at recursive resolvers, but none
>> of
>> that changes the key finding that public recursive resolvers see a
>> different DNS infrastructure.
>>
>> By the way, there is also residual risk here, which is also captured
>> in
>> final report draft text. There’s much more to say here to explain
>> it, but
>> it too does not change the key finding.
>>
>> As Casey points out, the implications noted in these key findings are
>> subject to discussion in a broader context. These implications will
>> be
>> brought forward to the final work product and discussed within the
>> context
>> of the workflow we have developed.
>>
>> In summary, there are two important things to consider. First, is the
>> data
>> analysis sufficient to support the key findings? Second, is the
>> residual
>> risk a fundamental concern or an ordinary risk management question to
>> be
>> considered?
>>
>> Some may be concerned they can not evaluate these issues directly
>> themselves. My suggestion is that we continue this discussion of
>> these
>> technical issues on the mailing list. This will facilitate thoughtful
>> and
>> detailed responses, and allow everyone the opportunity to share the
>> discussion with other experts to review the details.
>>
>> See you all tomorrow,
>>
>> Jim
>>
>>
>>
>>
>> On 25 Jan 2022, at 11:19, Casey Deccio wrote:
>>
>> My apologies that I am responding to my own email. Someone noted to
>> me
>> that I neglected two very important points. First, I mentioned that
>> I
>> didn’t agree with the conclusions of the document, but I only
>> provided
>> my critique of the analysis, not the conclusions. Second, I have not
>> explicitly provided any suggestions for a path forward. Let me
>> correct
>> that by acting on those suggestions.
>>
>>
>> Conclusions
>> ---------------
>>
>> ** Study 1 Key Observations
>> I have no disagreement with this section. These are an accurate
>> summary
>> of the results of Study 1.
>>
>>
>> ** Study 2 Key Observations
>> The following statement is true, but the qualifying factor “top”
>> is based
>> on a comparison (query count and IP address diversity) that is unfair
>> (see
>> Concerns 5 and 6).
>>
>> “Initial results from one PRR indicate there is a difference in top
>> non-existent TLDs using either query volume or source diversity
>> measurements.”
>>
>> The following two statements are generalities inferred from the
>> previous
>> statement, and which are not supported by the data, precisely because
>> of Concerns 5 and 6.
>>
>> “Many non-existent TLDs (roughly 40%) observed at the PRR are not
>> in the
>> top RSIs based on query volume. Nearly 30% observed at the PRR are
>> not in
>> the top RSIs based on source diversity.”
>>
>> “… name collision strings cannot be measured or assessed properly
>> based on
>> only using data from the RSS.”
>>
>> I agree and sympathize with the notion that there were heavy
>> constraints
>> of privacy and data aggregation associated with the analysis of the
>> public
>> recursive resolver data, but the comparison made thus far is an
>> unfair
>> comparison.
>>
>>
>> ** Key Findings
>>
>> The following statement is true, but based on analysis that was
>> performed
>> on highly biased data, specifically less than 1% of IP addresses
>> observed
>> at the root servers and top 10,000 and 1,000 of non-existent TLDs
>> (see
>> Concerns 1, 2, 3, 4, and 7):
>>
>> “Non-existent DNS queries for top querying and top source diversity
>> TLDs
>> appear to be comparable and representative at any RSI.”
>>
>> (Nit: Sentence reads “Non-existent DNS queries”, but I think what
>> is meant
>> is “DNS queries for non-existent TLDs.)
>>
>> The following statement is inconclusive because of the bias in the
>> data
>> that was analyzed.
>>
>> “PRR data further indicates that there is a very different view of
>> the top
>> non-existent TLDs based both on query volume and source diversity.”
>>
>> I do not *disagree* with the following statement:
>>
>> “ICANN, as the operator for the L RSI, is well-positioned to
>> instrument,
>> collect, analyze, and disseminate name collision measurements to
>> subsequent
>> gTLD applicants both prior to submission and during the application
>> review.”
>>
>> But I feel like the point of the document was to motivate this with
>> “You’ve seen one, you’ve seen them all.” And the analysis
>> does not support
>> that, at least not as generally as it was stated.
>>
>> I do not believe that following statements are supported by the data:
>>
>> “Name collision traffic observed at the root is not sufficiently
>> representative of traffic received at recursive resolvers to
>> guarantee a
>> complete and or accurate representation of a string’s potential
>> name
>> collision risks and impacts.”
>>
>> “Name collision strings cannot be measured or assessed properly
>> based on
>> only using data from the RSS. Obtaining an accurate picture of name
>> collision risks can only be obtained via delegation.”
>>
>> These might be well true, but the analysis in this document does not
>> motivate this (see Concerns 5 and 6). There are other factors that
>> might
>> contribute to these, some considered in this document (negative
>> caching) and others not (local root and aggressive negative caching).
>> But
>> my point is that the current analysis does not lead me to the
>> conclusions
>> included in the sentences above.
>>
>>
>> Suggestions for Improvement
>> ---------------
>>
>> ** Limit conclusions in “Key Findings”.
>>
>> Study 1 Key Observations is an example of conclusions that *can* be
>> drawn
>> from the existing analysis. The step from these to generalities of
>> representativeness of data in “Key Findings” is where my concerns
>> lie. If
>> the “Key Findings” related to Study 1 are honed to be in scope
>> with the
>> analysis, their impact might be significantly less, but it gives me
>> less
>> concern.
>>
>>
>> ** Revamp Root Server Analysis.
>> If the purpose of the document is to come to more general
>> conclusions,
>> such as those previously mentioned, then the analysis needs to be
>> revamped:
>> - Rather than selecting biased data (i.e., top talkers), the
>> data
>> must be representative of IP prefix, ASN, IP version. Those are
>> completely
>> missing from the current analysis.
>> - Before any filtering is done, a comprehensive analysis should
>> be
>> performed, with all IP addresses. Even if the conclusion is that a
>> filter
>> of some sort is appropriate, and a representative set can be yielded
>> with
>> such filtering, there is no comparison given, and the reader is
>> simply left
>> to make that leap of faith.
>> - Any filtering should not limit the analysis to the IP
>> addresses
>> with the most queries—certainly not the top 1%. Some filter might
>> by query
>> count might be fine, but it should be a low bar, and it should be
>> justified
>> by behavior and representation (IP prefix, ASN, IP version).
>>
>>
>> ** Revamp Comparison of Root Server Queries and Public Recursive
>> Resolver
>> I understand that there are data constraints within which the
>> recursive
>> data must be analyzed, but there are analyses that *can* be done,
>> even
>> within those constraints. For example, rather than sorting by top
>> query count and IP address diversity, start with the complete set of
>> non-existent TLDs and (if the data includes it) full QNAME diversity,
>> or at
>> least SLD diversity. As it is, the comparison of root server data
>> and
>> public recursive resolver data is unfair and therefore does not
>> provide
>> substance.
>>
>>
>> ** Accept Data-Driven Conclusions
>> There might be some desirable conclusions that the data simply does
>> not
>> support. I have little concern with listing conclusions that are not
>> what
>> was anticipated beforehand (i.e., hypotheses). Even very caveated
>> conclusions coming from an analysis can be enough to inform
>> decisions, if
>> those caveats are considered for what they are worth. I
>> have great concern, however, with conclusions that are not
>> data-driven
>>
>>
>> On Jan 24, 2022, at 11:25 PM, Casey Deccio <casey at deccio.net> wrote:
>>
>> Dear all,
>>
>> I have taken the time to study the “Perspective” document, as
>> well as the
>> document “Case Study of Collision Strings”, which is also being
>> produced by
>> NCAP in connection with Study 2. I appreciate all the time and
>> effort that
>> has gone into the analysis contained in “A Perspective Study of DNS
>> Queries
>> for Non-Existent Top-Level Domains”. I know that it has required
>> no small
>> effort.
>>
>> Nonetheless, I have fundamental concerns about the analysis contained
>> in
>> “Perspective”, and I also do not agree with the conclusions that
>> are
>> drawn from the analysis. Additionally, I find the analysis and
>> conclusions
>> in “Perspective” to be at odds with those contained in “Case
>> Study”. Finally, I believe my concerns to be substantial enough
>> that they
>> cannot be corrected with minor edits, and I *do not* support the
>> document
>> moving forward. I herein detail my concerns.
>>
>> Sincerely,
>> Casey
>>
>>
>>
>> Summary:
>> Concern 1: Analysis based on biased sample of querying IP addresses.
>> Concern 2: Sample data refined to support the conclusion.
>> Concern 3: Analysis based on biased sample of non-existent TLDs.
>> Concern 4: TLDs considered without QNAME context.
>> Concern 5: Query count used as comparison between recursive server
>> and
>> root servers.
>> Concern 6: Unique IP addresses used as comparison between recursive
>> server
>> and root servers.
>> Concern 7: Disagrees with findings from “Case Study of Collision
>> Strings”.
>>
>>
>> Details:
>>
>>
>> **Concern 1: Analysis based on biased sample of querying IP
>> addresses.
>>
>> The sample on which the analysis and conclusions are based is
>> selected
>> exclusively by proportion of queries observed during the
>> collection period. Specifically, fewer than 1% (0.67% or 115K) of
>> the 17M
>> IP addresses (15.51M IPv4 and 1.56M IPv6) observed across the 2020
>> DITL are
>> considered for the analysis—those producing the most queries (90%
>> of the
>> DITL data); that excludes 99% of IP addresses from analysis. Because
>> the
>> set of “top talker” IP addresses is selected based only on the
>> volume of
>> traffic, it is severely biased and is not necessarily representative
>> of
>> resolvers world-wide. Those that query most—for whatever
>> reasons—are the loudest, and without further examination, it’s
>> hard to even
>> know why. The concern is not even just whether or not it is okay to
>> exclude non-top-talkers, but whether top-talkers are themselves an
>> appropriate representation. Other metrics that could be used to
>> quantify network representation for the selection process and/or
>> analysis
>> of top-talkers are missing from the analysis, including IP prefix
>> (e.g., /16, /24, /48, /64), ASN, and even IP version. See
>> also Concern 7 for more.
>>
>> The analysis in Annex 2 is very interesting, but does not, by itself,
>> resolve this concern. The annex provides some very helpful lists of
>> top
>> queries from the few-queries resulting in NXDOMAIN responses, and
>> there are
>> some comparisons of the percentage of queries of resulting in
>> NXDOMAIN
>> responses for the total given number of queries, but even those are
>> difficult to assess without a full behavioral analysis.
>>
>>
>> **Concern 2: Sample data refined to support the conclusion.
>>
>> While the original sample is already of questionable representation
>> (less
>> than 1% of IP addresses observed, based solely on query volume), that
>> dataset is further refined, according to the following text (i.e.,
>> from the
>> document):
>>
>> “On average, each RSI observed 96% of the top talkers that account
>> for
>> 90% of total traffic. That percentage drops to 94% when using the
>> 95th
>> percentile top talkers. Based on these findings, only the 90th
>> percentile top talkers were used for the remaining measurements in
>> this
>> study.”
>>
>> If the objective of the analysis is to quantify the overlap of
>> observed
>> query data across the root servers, and to ultimately determine
>> whether the
>> queries observed at one server are representative of the queries
>> observed across all samples, then refinement of sampled IP addresses
>> to
>> support that conclusion is inappropriate.
>>
>>
>> **Concern 3: Analysis based on biased sample of non-existent TLDs.
>>
>> The queries for non-existent TLDs, which result in NXDOMAIN responses
>> at
>> the root servers, are compared across the root servers, to see how
>> well
>> they are represented. However, like observed IP addresses (Concern
>> 1), the
>> non-existent TLDs are limited to those corresponding to the most
>> queries
>> observed—both the top 10,000 and the top 1,000. This is
>> independent of
>> querying IP address, ASN, and other aggregating features, which would
>> help
>> better understand the diversity of the queries for each non-existent
>> TLD. For example, it might be that the non-existent TLDs most
>> queried for
>> come from a small pool of IP addresses or networks, and others are
>> being
>> excluded simply because they are outside that sample.
>>
>>
>> **Concern 4: TLDs considered without QNAME context.
>>
>> While comparisons are made to measure the representativeness of
>> non-existent TLDs, one primary feature missing from the analysis is
>> the
>> QNAME. In all cases, the non-existent TLD is considered in
>> isolation, yet
>> QNAME context is shown in the analysis to be a significant
>> contributor to
>> quantifying name collisions potential (see Concern 7).
>>
>>
>> **Concern 5: Query count used as comparison between recursive server
>> and
>> root servers.
>>
>> Because of (negative) caching at recursive servers, it is expected
>> that
>> queries observed at the root servers for a given non-existent TLD
>> will be
>> fewer than those at a recursive resolver for that
>> same non-existent TLD. It is this very caching behavior that
>> makes the comparison of query count for a given non-existent TLD, as
>> observed by the root servers vs. a recursive resolver, an
>> apples-to-oranges
>> comparison. Yet the analysis includes a comparison of the top 1,000
>> non-existent TLDs, ranked by query count. Thus, no meaningful
>> conclusions
>> can be drawn from this comparison.
>>
>>
>> **Concern 6: Unique IP addresses used as comparison between recursive
>> server and root servers.
>>
>> Study 2 includes source diversity when comparing the query counts for
>> non-existent TLDs. There is certainly more value in investigating IP
>> source diversity when considering the query counts for non-existent
>> TLDs
>> that considering query counts alone (Concern 5). However, it is
>> expected
>> that recursive resolvers serve a very different client base than
>> authoritative servers, specifically the root servers. Whereas the
>> former would might expect queries from stub resolvers, the latter
>> might
>> expect queries from recursive resolvers. In such a case, analyzing
>> client
>> IP addresses independently of one another leaves significant
>> meaningful
>> context out, such as the diversity of IP prefixes or ASNs from
>> which queries arrive. A large number of IP addresses from the same
>> IP
>> prefix or ASN might be responsible for the queries associated with
>> several
>> “top” non-existent TLDs, excluding non-existent TLDs that
>> might have non-trivial presence but do not have the top IP address
>> diversity. See also Concern 7.
>>
>>
>> **Concern 7: Disagrees with findings from “Case Study of Collision
>> Strings”.
>>
>> The document “Case Study of Collision Strings”, also written in
>> connection
>> with NCAP Study 2, contains the following findings:
>>
>> 1. “A relatively small number of origin ASNs account for the
>> vast
>> majority of query traffic for .CORP, .HOME, and .MAIL. In all cases
>> roughly
>> 200 ASNs make up nearly 90% of the volume” (section 4.1.5).
>>
>> 2. “Label analysis provides a unique observational context into
>> the
>> underlying systems, networks, and protocols inducing leakage of DNS
>> queries
>> to the global DNS ecosystem. Understanding the diversity of labels
>> can help
>> provide a sense of how broadly disseminated the leakage is throughout
>> the
>> DNS” (section 4.2.1).
>>
>> 3. “The .CORP SLDs seen at both A and J (approximately 16
>> thousand) is
>> almost equal to those seen at A-root alone, but J-root sees over
>> 30,000
>> .CORP SLDs that A-root does not see” (section 4.3.1).
>>
>> 4. “Across all names studied, while A and J saw much in common,
>> there
>> was a non-negligible amount of uniqueness to each view. For example,
>> A and
>> J each saw queries from the same 5717 originating ASNs, but J saw
>> 2477 ASNs
>> that A didn't see and A saw 901 that didn't see” (section 4.3.2).
>>
>> 5. “A more intensive and thorough analysis would include other
>> root
>> server vantage points to minimize potential bias in the A and J
>> catchments”
>> (section 5.2).
>>
>> 6. “Additional measurement from large recursive resolvers would
>> also
>> help elucidate any behaviors masked by negative caching and the
>> population
>> of stub resolvers” (section 5.2).
>>
>>
>> These findings emphasize the following points, which are at odds with
>> the
>> "Perspective" document:
>>
>> - Including ASN (and IP prefix) in an analysis can make a
>> significant difference in the overall diversity associated with
>> observed queries.
>>
>> - There is significance in the context provided by the QNAME,
>> not
>> only in measuring diversity, but also in query representativeness
>> across
>> root servers.
>>
>> - Root servers—even just A and J—have a non-negligible
>> amount of
>> uniqueness that is not captured—or even addressed in this document.
>>
>> - More root servers have a greater perspective of potential
>> name
>> collisions than one.
>>
>> - The population of stub resolvers should be considered in the
>> analysis of large recursive resolvers.
>>
>>
>>
>> On Jan 24, 2022, at 2:43 PM, Thomas, Matthew via NCAP-Discuss <
>> ncap-discuss at icann.org> wrote:
>>
>> NCAP DG,
>> As set during our last meeting on 19 January 2022, we pushed the
>> start of
>> the public comment period for “A Perspective Study of DNS Queries
>> for
>> Non-Existent Top-Level Domains” to this Thursday, 27 January 2022,
>> in order
>> to accommodate some last minute questions. Additionally, as
>> previously
>> announced, today ends the comment period for the release of this
>> document.
>> Attached is the FINAL DRAFT version of “A Perspective Study of DNS
>> Queries
>> for Non-Existent Top-Level Domains”. If you have any objections to
>> this
>> document being released for public comment please reply to this
>> message on
>> the list. The objection period will close at the end of our weekly
>> meeting
>> on Wednesday, 26 January 2022. Comments that do not substantially
>> change
>> our stated conclusions will be captured and considered after the
>> public
>> comment period when we will be reviewing all public comments
>> received.
>> A view only version of the document is here:
>> https://docs.google.com/document/d/1lMoicMusWGB0u3Jpqe0CB7kJHw_gR-alIXZErjxNEws/view#
>> Matt Thomas
>> <Last Call A Perspective Study of DNS Queries for Non-Existent
>> Top-Level
>> Domains.pdf>_______________________________________________
>> NCAP-Discuss mailing list
>> NCAP-Discuss at icann.org
>> https://mm.icann.org/mailman/listinfo/ncap-discuss
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of
>> your
>> personal data for purposes of subscribing to this mailing list
>> accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy)
>> and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You
>> can
>> visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery
>> or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>>
>>
>>
>> _______________________________________________
>> NCAP-Discuss mailing list
>> NCAP-Discuss at icann.org
>> https://mm.icann.org/mailman/listinfo/ncap-discuss
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of
>> your
>> personal data for purposes of subscribing to this mailing list
>> accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy)
>> and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You
>> can
>> visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery
>> or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>>
>> _______________________________________________
>> NCAP-Discuss mailing list
>> NCAP-Discuss at icann.org
>> https://mm.icann.org/mailman/listinfo/ncap-discuss
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of
>> your
>> personal data for purposes of subscribing to this mailing list
>> accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy)
>> and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You
>> can
>> visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery
>> or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>
>
>
> --
> Thomas Barrett
> President
> EnCirca, Inc
> +1.781.942.9975 (office)
> 400 W. Cummings Park, Suite 1725
> Woburn, MA 01801 USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/ncap-discuss/attachments/20220125/e05889c8/attachment-0001.html>
More information about the NCAP-Discuss
mailing list