/contrib/famzah

Enthusiasm never stops

DNS icon by PRchecker


Leave a comment

Can your local NFS connections get broken by your external Internet connection?

Long story short: Yes! A flaky Internet connection to the outside world can make your local NFS client-server connections unusable. Even when they run on a dedicated storage network using dedicated switches and cables. This is a tale of dependencies, wrong assumptions, desperate restart of services, the butterfly effect and learning something new.

The company I work for operates 1000+ production servers in three data centers around the globe. This all started after a planned, trivial mass-restart of some internal services which are used by the online Control Panel. A couple of minutes after the restarts, the support team alarmed me that the Backup section of the Control Panel is not working. I acted as a typical System Administrator and just tried if the NFS backups are accessible from an SSH console. They were. So I concluded that most probably it wasn’t something with the NFS service but it was a problem caused by the restart of the internal services which keep the Control Panel running.

So we called a system developer to look into this. In the meantime I discovered by tests that the issue is limited only to one of our data centers. This raised an eyebrow but still with no further debug info and with everything working under the SSH console, I had to wait for input from the system development team. About an hour later they came up with a super simple reproducer:

perl -e 'use Path::Tiny; print path("/nfs/backup/somefile")->slurp;'

strace() shown that this hung on “flock(LOCK_SH)”. OMG! So it was a problem with the System Administrators’ systems after all. My previous test was to simply browse and read the files, and it didn’t occur to me to try file locking. I didn’t even know that this was used by the Control Panel. It turns out to be some (weird) default by Path::Tiny. A couple of minutes later I simplified the reproducer even more to just the following:

flock --shared /nfs/backup/somefile true

This also hung on “flock(LOCK_SH)”. Only in the USA data center. The backup servers were complaining about the following:

statd: server rpc.statd not responding, timed out
lockd: cannot monitor %server-XXX-of-ours%

The NFS clients were reporting:

lockd: server %backup-IP% not responding, still trying
xs_tcp_setup_socket: connect returned unhandled error -107

Right! So it’s the “rpc.statd” which just died! On both of our backup servers, simultaneously? Hmm… I raised the eyebrow even more. All servers had weeks of uptime, no changes at the time when the incident started, etc. Nothing suspicious caused by activity from any of our teams. Nevertheless, it doesn’t hurt to restart the NFS services. So I did it — restarted the backup NFS services (two times), the client NFS services for one of the production servers, unmounted and mounted the NFS directories. Nothing. Finally, I restarted the backup servers because there was a “[lockd]” kernel process hung in “D” state. After all it is possible that two backup servers with the same uptime get the same kernel bug at the same time…

The restart of the server machines fixed it! Pfew! Yet another unresolved mystery fixed by restart. Wait! Three minutes later the joy was gone because the Control Panel Backup section started to be sluggish again. The production machine where I was testing was intermittendly able to use the NFS locking.

2h30m elapsed already. Now it finally occurred to me that I need to pay closer attention to what the “rpc.statd” process was doing. To my surprise strace() shown that the process was waiting for 5+ seconds for some… DNS queries! It was trying to resolve “a.b.c.x.in-addr.arpa” and was timing out. The request was going to the local DNS cache server. The global DNS resolvers 8.8.8.8 and 1.1.1.1 were working properly and immediately returned “NXDOMAIN” for this DNS query. So I configured them on the backup servers and the NFS connections got much more stable. Still not perfect though.

The situation started to clear up. The NFS client was connecting to the NFS server. The server then tried to resolve the client’s private IP address to a hostname but was failing and this DNS failure was taking too many seconds. The reverse DNS zone for this private IPv4 network is served by the DNS servers “blackhole-1.iana.org” and “blackhole-2.iana.org”. Unfortunately, our upstream Internet provider was experiencing a problem and the connection to those DNS servers was failing with “Time to live exceeded” because of a network loop.

But why the NFS locking was still a bit sluggish after I fixed the NFS servers? It turned out that the “rpc.statd” of the NFS clients also does DNS resolve for the IP address of the NFS server.

30 minutes later I blacklisted the whole “x.in-addr.arpa” DNS zone for the private IPv4 network in all our local DNS resolvers and now they were replying with SERVFAIL immediately. The NFS locking started to work fast again and the Online Control panels were responding as expected.

Case closed. In three hours. Could have been done must faster – if I only knew NFS better, our NFS usage pattern and if I didn’t jump into the wrong assumptions. I’m still happy that I got to the root cause and have the confidence that the service is completely fixed for our customers.


Leave a comment

Dynamic DNS using AWS Route 53

The Internet ecosystem and technologies advanced so much lately that you can rebuild an entire business from scratch in a few hours of coding and at pretty acceptable costs. I’m referring to the dynamic DNS (aka. DDNS or DynDNS) service which was a hit a few years back. It took me less than a hundred lines of code to create a simple dynamic DNS using AWS Route 53. The AWS API and backend provide the DNS service, while the free service “ipify” lets you look up your real remote IP address. While this solution is not free as speech, it’s free as beer and costs less than a dollar per month.

DNS icon by PRchecker


Leave a comment

ResellerClub technical problems with DNS and Domain Forwarding

Today I woke up and found out that many of my domains aren’t working. I’m using ResellerClub (aka. DirectI) free DNS and free Domain Forwarding. Well, they deleted all my DNS records. The Domain Forwarding service stopped working too.

I’ve filed a support ticket with an “Emergency” priority. Let’s see what happens now… I’ll keep you updated.

UPDATE: Instead of just fixing things, ResellerClub really disappointed me. I am their long-term and very faithful customer, but now I see that I shouldn’t have trusted them so much.

It took two days for ResellerClub’s support to respond with the following ridiculous statement:

DNS records are merely record entries made in the server, and not a space occupying entity for which a ‘backup’ would be available/generated.
Hence there is no backup for the records at our end.

ResellerClub keep no backup whatsoever of their DNS and Domain Forwarding configurations! The explanation of this fact is hilarious — because DNS records do not occupy space. 🙂

This really pissed me off, and I asked them how they could operate without a backup. What if the disks of their DNS or database servers fail? What if an operator deletes a record by mistake? The answer by ResellerClub’s support was that they monitor their servers, so this is not an issue. Ha-ha! Since when is monitoring a substitution of backups?? Furthermore, they added that the system was fully automated, so no operator’s mistake was possible. I replied that even if the system operates automatically, it is still being maintained by humans, who may delete data by mistake. How would they explain why my DNS and forwarding data was lost…

Anyway, no need to dig into this any further. I re-created the DNS records, and also learnt an important lesson — never trust an (IT) organization unless you really know that they operate in a professional way by following all well-established principles in the industry. ResellerClub ain’t one of them!