Blog Post

DNS misconfiguration can happen to anyone - the question is how fast can you detect it?

Updated

Published

August 14, 2024

mins read

Dritan Suljoti

in this blog post

Heading 2

Even after decades of building web applications and troubleshooting live production issues, the thrill of solving why some random website is failing never fades.

Last week, a colleague shared a link to ONUG’s website about their upcoming event in NYC this fall.

I clicked on the link, and was waiting, and waiting, and waiting for the page to load and it did not. Finally, after about 30 seconds, Chrome greets me with “ERR_CONNECTION_TIMED_OUT”

A screenshot of a computerDescription automatically generated

Initial troubleshooting

Quickly I tried to go somewhere else on the internet (quickest to find are you connected when in the browser already), and every website I could think of was working fine. The link was also working well for some of my colleagues – which made it even more interesting.

I was quite surprised by the issue as you wouldn’t expect it out of an organization that has “networking” in the name - ONUG does stand for Open Networking User Group and focuses on IT leaders of large enterprises. So, I was intrigued as to what could be so tricky that even ONUG would be impacted.

Curious as to what was going on, I decided to launch Developer Tools and go back to ONUG’s site. It again failed to connect, and although Chrome tried to automatically reload the URL - it again greeted me with “This site can’t be reached - onug.net took too long to respond.”

Instinctively I clicked on one of the failing requests in the Developer Tools to look at what IP address the browser was connecting to but saw no IP.

Unfortunately, this reminded me that Chrome wouldn’t show the user the IP address in Developer Tools, unless it received some response from the server. A sad reminder that 11 years ago we asked for a Chromium enhancement to add the IP addresses to developer tools and their APIs for these cases, and now almost 100 Chrome versions later – a user is still unable to see what IP address the browser couldn’t connect to.

But hey, Chrome was nice to think that by reloading the request over and over, the problem of connecting to a server would somehow be fixed! (recall when the solution to any failure was “reboot the machine”)

Diving deeper with ping & traceroute

I moved on to the next handy tools on my desktop: Ping and Traceroute – but everything was working getting to “162.159.135.42”, no packet loss or high latency.

Perplexed as to what was going on, I simply launched the IPM platform we built at Catchpoint and started monitoring onug.net from the 1,274 locations we have globally. Within seconds it became clear what the issue was, when reaching the IPv4 address it worked perfectly, but when you hit the IPv6 address – that one was not able to establish a connection.

The data from the network tests performed showed clearly 100% failure connecting to the IPv6 address for ONUG.

A screenshot of a graphDescription automatically generated — *Catchpoint Explorer shows time to load the ONUG URL and failures in red.*

‍

*Catchpoint Waterfall Showing TCP Connection timed out after 15 seconds*

A close-up of a computer screenDescription automatically generated — *Network paths showing working and failing routes*

The network path showed that onug.net is sending IPv6 traffic to Akamai Linode, and IPv4 Traffic to Cloudflare. The DNS for the website is configured to have both A and AAAA records, therefore any users on only IPv4 see it working fine; however, any user on dual stack IPv4 and IPv6, is at the mercy of what the OS and the Chrome picks from the records. If an AAAA record is picked (IPv6) then they are experiencing the same issue as I did, while it works just fine when the A record is picked.

Key lessons learned

There are some key lessons to learn from this experience:

First, do not rely on end users to tell you that you are down and why you are down. Not every user will be going through this amount of troubleshooting to tell you that you are having an issue. Most users will just move on and use a different (competitor) site instead or just forget why they came to your site in the first place.
Stop thinking of service outages as globally down, a service outage means some users cannot get the experience they’re expecting to receive from a service - caused by you or your vendors/partners, and it is within your control to fix it. Over the years it has been clear that micro-outage, outages that impact a select group of users, are becoming the most common issue facing IT teams give the increase on cloud infrastructure and services.
You can’t rely on APM or “inside-out” tools, like flow-based network tools, to monitor the real-world user experience. I can see an issue like this resulting in IT saying, "it works for me" and "my APM shows no errors" or “network is fine, traffic is coming through and getting what it requested”, concluding it must be user error. You need to monitor from where your users are, proactively.
If you are going to support IPv6, make sure you monitor both IPv4 and IPv6 synthetically and don’t assume that a synthetic test from some location of your observability tool is going to tell you what is happening.
Make sure you monitor your DNS, and make sure you validate that you have the right entries so you can catch misconfigurations - like whether someone left a AAAA record of the previous CDN, datacenter, or hosting provider.
As the saying goes, it’s always DNS. Well, it is not always DNS. Maybe a more accurate saying would be troubleshooting starts with DNS, but that does not have the same ring to it. But at the end of the day is “always you” who is accountable for your service to work, whether you or your team caused it, or your vendors caused it. Don’t deflect, take ownership.

Situations like this one result in many micro-outages like this one, intermittent errors, and regional errors to go undetected, or blamed on the users. We must recognize the number of factors from DNS resolution to ISP performance, to routing, to dozens of others that can impact the performance or availability of applications for users in particular regions.

This simple situation highlights the importance of continuous, proactive monitoring, from where your users are, and having the right technology at hand to catch errors like this one and to detect and solve issues before users are impacted.