[rssac-caucus] [Ext] Handing the anonymization document off to RSSAC

Fri Apr 13 20:04:38 UTC 2018

Apologies -- for top-posting a response, and for being late to the
"party"...

If I understand correctly the root of the issue is the privacy concern of
data that contains both source IP addresses and DNS QNAMEs in queries. And
specifically being able to recover original data from published anon...
data?

The document talks (only) of anonymizing (a11g)  the source IP (v4 and v6)
addresses. Has any discussion on instead doing something similar to the
QNAMEs?

The underlying problem(s) still exist, in terms of use of "secrets" for
doing the coordinated a11g the data between a largish set of operators.
However, I thing the QNAMEs, being a much more diverse set of data, have
better privacy characteristics after a11g.
(And for ease of processing and correlating, I think the QNAME portion that
should be handled via a11g, is the stuff excluding the TLD portion of the
QNAME. Possibly map NULL to NULL for queries doing QNAME minimization too?)

To be clear, what I'm talking about is KEEPING the original source
addresses, and ONLY a11g the QNAMEs. Perhaps the use of a daily nonce plus
hash works. E.g. nonces shared or used only when DITL sets are
published/shared, and the nonces subsequently destroyed? Maybe centralizing
the nonce/hashing, and maybe using a suitable secured processing facility,
and/or "certified" set-ups if bigger players want to offload some of the
effort, would address any of the residual security issues/concerns on the
published data. (NTP sync is obviously presumed if the correlation is to be
done with daily nonces.)

Obviously all of the providers of data would generally still keep the
originals, but there would not be a need for that to be kept in any common
(centralized) location shared by the DITL participants.

Thoughts?

Does this do a better job than the IP a11g?

Brian

On Thu, Apr 12, 2018 at 2:21 PM, Warren Kumari <warren at kumari.net> wrote:

> Birthday collisions make my brain hurt -- I got into a shouting match
> one with Dan Harkins where I was claiming that with 32 bits of random
> MAC address and 2000 stations you would basically never have a
> collision; he disagreed...
>
> In a fit of pique I wrote a small AppEngine app to prove him wrong --
> and did exactly the opposite -- with 32bits of random and 2000
> stations you will get a collisions roughy once every 2150 times - app
> is here if people want to play:
> http://mac-collision-probability.appspot.com/calculate
> We had a similar discussion on IPv6 - slightly tweaked code here:
> http://ipv6-collision-probability.appspot.com/calculate
>
> Sometime I'll tweak this to do something other than bitlengths, and to
> report how many collisions there would be...
>
> Funnily enough, Wes and I were driving to the San Jose NANOG a few
> months back, and stopped in a niceish restaurant for dinner. There
> were roughly 30 other people -- and while we were there there were 2
> groups of people celebrating birthdays (cake, singing, etc). It was
> only after we left that Wes point out that this was the archetype
> Birthday Paradox example :-) [0].
>
> W
> [0]: Yes yes, I know that this isn't representative - people go out to
> dinner to celebrate which biases the results, some other people might
> also have been having birthdays and didn't cake and singing, the
> groups who were (obviously) celebrating may have had their birthdays a
> fews days back / in the future, etc. Great, now you've ruined it, hope
> you are happy...
>
> On Wed, Apr 11, 2018 at 11:54 PM, John Heidemann <johnh at isi.edu> wrote:
> >
> > (about the document at
> > https://docs.google.com/document/d/1jpFcEjlwd11kqbsd1oAUf2Hq3gNsk
> qN595RdmvyKkU8/edit#
> > )
> >
> > On Thu, 12 Apr 2018 02:19:15 -0000, Paul Hoffman wrote:
> >>On Apr 11, 2018, at 11:06 AM, John Heidemann <johnh at isi.edu> wrote:
> > ...
> >>> - section 4.1: the analysis of collisions was for an average day.
> >>>  Collisions are dramatically higher for worst cases, and that's when
> >>>  accurate counts most matter for some research.  I suggest this text
> >>>  there to address this gap:
> >>>
> >>>          (Although the birthday problem has few collisions when the
> >>>          number of active IPv4 address is small, it is much worse when
> >>>          the number is large.  For example, reports of the Nov. 30,
> >>>          2015 DDoS attack on the roots indicate that roots saw about
> >>>          891k unique addresses, and with n=900k, there are 170M
> >>>          collisions.  While many of these addresses were spoofed.  This
> >>>          count represents one factor in the cost some DDoS-defenses, so
> >>>          accuracy is important.).
> >>
> >>See the comment in the text. Those numbers make no sense. How can you
> get 20x more collisions than there are values?
> >
> > You're right.  I went back to the source and the right numbers is 895M
> > unique addresses, not 891k.  With n=900M there are 170M expected
> > collions.  Thanks for catching this.
> >
> > (The formula is in the text, so anyone can check them math.  The point
> > is collisions grow precipitously as the number of adresses approaches a
> > substantial fraction of the total space.)
> >
> >    -John
> > _______________________________________________
> > rssac-caucus mailing list
> > rssac-caucus at icann.org
> > https://mm.icann.org/mailman/listinfo/rssac-caucus
>
>
>
> --
> I don't think the execution is relevant when it was obviously a bad
> idea in the first place.
> This is like putting rabid weasels in your pants, and later expressing
> regret at having chosen those particular rabid weasels and that pair
> of pants.
>    ---maf
> _______________________________________________
> rssac-caucus mailing list
> rssac-caucus at icann.org
> https://mm.icann.org/mailman/listinfo/rssac-caucus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/rssac-caucus/attachments/20180413/3a6e592f/attachment.html>