[ksk-rollover] Description of my analysis of the too-many-KSK queries problem

Wed Apr 3 20:56:14 UTC 2019

TL;DR Background: After the revoke bit was set, all the roots suddenly
started receiving significant increases in DNSKEY queries.  No one knew
why.

TL;DR below: I was able to start a bind resolver with specific DNSSEC
config settings that caused 6+ outgoing requests for root/DNSKEYs for
every incoming request. Then it would go silent, and then it would start
over-querying again...

----

In the end, shortly before the revoked key was removed from the
published keys list on March 22, I took a quick look into possible
causes.  I discussed some of my findings, without conclusions, at the
DNSSEC workshop [1], as did Duane.

After the workshop concluded, I noticed that a bunch of addresses
sending lots of queries were from the CS department Purdue.  We (ISI)
reached out to them asking for help in discovering what the problem was.
They discovered that one of the CS labs, which made use of VMs, had left
a bunch of VMs running that had DNS resolvers and authoritative servers
on them.  They were kind enough to send me a copy of the lab setup
instructions (which included bind running on Lubuntu), which allowed me
to reproduce the situation.

The key to the problem turns out that they were specifically using these
settings in the bind config file:

    //dnssec-validation auto;
    dnssec-enable no;

The first link clearly sets the validation state to the default
behavior (IE, it was explicitly commented out in the file, and not set
to a particular state).  I believe the default state is "auto", though I
don't know for which versions of bind this is true.

The second line, dnssec-enable, was deliberately set to "no".

With these two settings adjusted in my default bind.conf file from
Fedora (I didn't have Lubuntu to test against), and a bind "managed
keys" file that contained only the KSK2010 key and the dlv.isc.org key,
I started up bind (bind-9.11.5-4.P4 from Fedora) and tcpdump in
parallel.

The results were (frustratingly) intermittent.  Sometimes there was an
issue, and sometimes not.  In the end the managed-keys file was updated
to contain just the KSK2017 key.

When there was an issue, every query I sent from dig at localhost to the
resolver at localhost caused 6-7 outgoing queries for DNSKEY and the root
servers (I think it required requests for a new TLD each request).
After a number of queries, negative caching seemed to finally kick in as
the resolver would stop sending DNSKEY queries for a while.  Then, on
the order of minutes, later it would starting sending more DNSKEY
queries in bulk for every incoming request.  And as I said, sometimes
the resolver failed to even enter the "broken" state.

Evan, at the IETF, reported in a few meetings and conversations that
they had discovered a bug in bind previously that would exhibit this
roll-over-and-die type behavior but that it was only present in
out-of-date versions of bind (9.10 and below I believe he stated).  I
appreciate the bind team looking into the problem.  I can't,
unfortunately, reconcile why my bind 9.11 on Fedora seemed to have the
same issue however.  My colleague Robert Story reproduced the issue as
well and had tried Fedora 27-29, though I don't remember which versions
had the issue.

[1]: https://icann.zoom.us/recording/play/OYJ4R7IQCxnF5Rw6Kk_nnJQqrp4037W7eS9eeJegKbC-CoPrxa9Z2YpD-594FNJR?startTime=1552439865000

-- 
Wes Hardaker
USC/ISI