One of my racks, because who doesn't have two racks of network gear in their basement?
My home network is way more complicated than it needs to be, but I can account for everything in it as I personally set it all up. As a longtime UNIX guy who has been employed at networking companies for a few decades, I have what some might call an unreasonable amount of hardware and software infrastructure in place. Both to complement that, and to help me deal with it, I run a pair of ExtraHop virtual appliances at home (an EDA and an EXA).
Things go wrong
The other day I suddenly noticed that I couldn't access resources on my local network while using my laptop. Well, that wasn't true; existing connections were just fine, I just couldn't open anything new... no,wait... I could, just not by hostname.
It turns out, rather than using either of my local resolvers (which have a DNS zone for my private network), my laptop was using 4.2.2.2 for DNS. Now, I've intentionally used that plenty of times before (I even have a special exception included in my firewall config allowing me to reach it), so it was pretty familiar to me - but neither of my DHCP servers should have been handing that out... so why did it change suddenly? I had a vague memory of this happening once before a few months ago. I recall I just refreshed my DHCP lease and things went back to normal - I probably even thought "well, if that happens again I'll care about it then."
Well, it was happening again.
ExtraHop to the rescue!
Luckily though, I just so happened to have recently upgraded my ExtraHop Discover Appliance to v5.0 (which added support for DHCP amongst some other nifty features) and installed the new DHCP dashboard that one of the engineers on my team made, so I figured I'd take a look at it - and I immediately noticed that I had a third DHCP server on my network (a Debian system that was not a member of my failover pair of DHCP servers), and looking at the records for its DHCP responses in my ExtraHop Explore Appliance, I saw that it was handing out 4.2.2.2 as the nameserver and had given such a response to my laptop a few minutes earlier. Then the pieces started to fall into place: the mtime
for /etc/dhcp/dhcpd.conf
on the rogue DHCP server was about a year ago when I was living somewhere else, and that system was being used for a router as well as some light server duties such as DHCP (did I mention I like to mess with things unnecessarily?). Apparently then I wasn't making use of local DNS and was sending DNS clients directly to 4.2.2.2 - and also didn't clean up after myself once I decommissioned it from those duties. Totally my fault.
But why wasn't this causing constant chaos? Well, because that system was on a different switch (in a different rack - because who doesn't have two racks in their basement?), DHCP requests and responses had to traverse an extra L2 hop and so it generally tended to respond after the real ones. It occasionally did win the race to respond (and my ExtraHop appliances showed me when that happened nicely), but only very rarely. It had basically been sitting there on my network for months trying desperately to respond to every DHCP requests, and then occasionally causing trouble.
That is, until I turned the powers of ExtraHop at it and squashed it. A quick apt-get purge isc-dhcpd-server
and that problem was gone.
Epilogue
I immediately messaged Heath to let him know that I had just used it to locate a rogue DHCP server on my own network - a mere day or two after he published the dashboard! I almost couldn't believe it myself.
I was already a believer in the power of wire data, but this really made it ring home. This was a system that I wasn't monitoring with synthetic transactions, didn't have an agent installed on it that I was collecting data from, and wasn't feeding into my centralized logging. I was blind to it without wire data.