[rssac-caucus] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

Wed Feb 21 22:34:39 UTC 2018

On Feb 14, 2018, at 6:21 AM, John Bond <john.bond at icann.org> wrote:
>> 1.2 Terminology 
> I think we should just reference RSSAC026 instead of repeating the definition of RSO in this document

We normally quote other documents for terminology, and always give a reference in a footnote.

>> 2. Introduction to Anonymization
> Duane already made a comment other identifiable information in DNS packets on this however i wanted to specifically highlight ENDS client subnet and suggest that anything that works on the IP source address should also work on the EDNS client subnet if present

Your request caused me to go back to the RSSAC request to RSSAC Caucus for this work from a year ago; see the attached file. In that request, it says "source IP address of the queries" a few times. (I also realize that we were using the wrong title for this document, and will fix it based on the RSSAC request.)

Having said that, I will add text indicating that whatever procedure is used for anonymizing source addresses can also be used for other addresses.

>> 2.1 Benefits and Drawbacks of Harmonization of Anonymization
> When discussing the drawbacks the document only concerns itself with key distribution issues and doesn't address any of the privacy concern.  It seems to make the assumption that the datasets have to be harmonised so research can continue.  This may be by design however i think that the document should at least mention that this harmonisation of data does make it easier to personally identify individuals.  INAL but anonymization of data in this manner may not be enough to prevent it from been considered personally identifiable when considering things such as GDPR.  especially when you enter into the fact that Third parties, not under the jurisdiction of the EU have access to the shared key(s).  If i was to consider privacy vs abillity to research then the following options would seem to be worth considering in order with the highest level or privacy and most difficult to research first.
> 
> 
> 1) remove IP addresses completely
> 2) Each operator encrypts the IP address with there own key and rotates the salt every x minutes
> 3) Each operator encrypts the IP address with there own key 
> 4) operators encrypts the IP address with a shared key and rotates the salt every x minutes
> 5) operators encrypts the IP address with a shared key 
> 6) no change
> 
> 
> In my mind option 2 and 4 are worth considering as it would allow researches the ability to track patterns and see data shiffting, but would make it difficult to track an individual user across the entire time series.  Im not a researches so don't know what impact this would have but i think it adds a lot to the privacy of the data set.  for instance in the schemes suggested if you see that IP 192.0.2.1 (or whatever it is hashed to) always goes to smtp.johnbond.org then you can probably assume that  IP 192.0.2.1 belongs to me if IPs only ever have a one-to-one mapping then someone could track my usage through the entire time series.  It makes little difference that 192.0.2.1 is not my real IP addresses and has been anonymised.

Nothing in this document is meant to be advice about whether the type of anonymization is "good enough" for any particular purpose. If RSSAC wants such a document, they need to ask for something different.

>> 3.2 Mixing Bit-By-Bit: Cryptopan
> The cryptopan paper acknowledges that due to the one-to-one mapping it is susceptible to know plain text attacks[1] and some services will be trivial to identify regardless of how we anonymise them.  I wonder if we could get the paper authors to re-run there attack scenarios on a Cryptopan encrypted DITL and see how much of the data the could be de-annonymise 

The fact that Cryptopan is prefix-preserving directly leads to it being useful for approximately determining other addresses in the same prefix.

>> 3.3 ipcrypt
> The one-to-one mapping also means it is susceptible to a know plain text attack but to what severity is unknown however the lack of prefix preservation would likely make any attack harder [then Cryptopan attacks]

A known-plaintext attack returns the key used, or allows the attacker some other way of de-anonymizing other addresses. That is not possible in the methods other than Cryptopan. However, if I can inject a query using a known source address to a particular root using an identifiable QNAME, I can find the result in the anonymized PCAP. What is important is that an attacker cannot use this to then determine the random key that was used.

> 
>> 4 ASN and recommendation 3
> I'm strongly apposed to this as i it would make de-annonamising the information and the know text attacks mentioned above much simpler to execute. 

Are you suggesting that we remove the recommendation (which Geoff Huston made) or simply make it clear that it is optional?

--Paul Hoffman

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20170203-work-statement-anonymization.pdf
Type: application/pdf
Size: 64339 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/rssac-caucus/attachments/20180221/d344a416/20170203-work-statement-anonymization.pdf>
-------------- next part --------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3906 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/rssac-caucus/attachments/20180221/d344a416/smime.p7s>