Tuesday, February 5, 2013

Fun when mail server receives SERVFAIL instead of NXDOMAIN...

Ok, I got log files overflowed with error messages like this one:
Feb 5 11:01:35 mail named[994]: error (host unreachable) resolving 'sbs-music.com/NS/IN':
In essence, name server for this domain ( is unreachable from the DNS server used by mail server. Trying to manually query the server, I get:
$ host -t ns sbs-music.com
Host sbs-music.com not found: 2(SERVFAIL)
Note the status, it's SERVFAIL. The result is that mail server thinks it is a temporary error and retries later, with the same results. Trying this on another host (that uses another DNS server) I get:
$ host -t ns sbs-music.com
Host sbs-music.com not found: 3(NXDOMAIN)
Well, this time it tells me that there is no such domain. An error message like this would tell mail server to give up and return error response.

So, why is there discrepancy between the two? Using tcpdump in the first case (i.e. when we get SERVFAIL) the following requests/responses are exchanged (slightly edited for readability):
192.168.x.y.51892 > 39504 [1au] NS? sbs-music.com. (42) > 192.168.x.y.51892: 39504 4/0/5 NS ns4.shepherdhosting.com., NS ns1.shepherdhosting.com., NS ns2.shepherdhosting.com., NS ns3.shepherdhosting.com. (194)
192.168.x.y.10749 > 40104 [1au] NS? sbs-music.com. (42)
192.168.x.y.63081 > 56636 [1au] NS? sbs-music.com. (42) > 192.168.x.y.63081: 56636 4/0/5 NS ns1.shepherdhosting.com., NS ns2.shepherdhosting.com., NS ns3.shepherdhosting.com., NS ns4.shepherdhosting.com. (194)
192.168.x.y.31948 > 27220 [1au] NS? sbs-music.com. (42)
So, let me interpret this trace. The first query is to IP address and it asks for the name server of a domain sbs-music.com. Doing reverse DNS query, we get:
# host domain name pointer sh214.shepherdhosting.com.
So, it's some hosting provider. Now, what we get in response is that name servers for that domain are ns1.shepherdhosting.com through ns4.shepherdhosting.com. Ok, our DNS server choose ns4.shepherdhosting.com with IP address Then, it queried it for sbs-music.com domain. This time the query timed out. So, our DNS server decided to query again for name servers. It again received the same list and then it again queried ns4 which didn't answer the query.

Let us manually try some other server. So, querying ns1.shepherdhosting.com we get:
# host -t ns sbs-music.com.
Using domain server:

sbs-music.com name server ns2.shepherdhosting.com.
sbs-music.com name server ns3.shepherdhosting.com.
sbs-music.com name server ns4.shepherdhosting.com.
sbs-music.com name server ns1.shepherdhosting.com.
This is the same response we saw in the first part of the trace. Trying ns2:
$ host -t ns sbs-music.com.
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached
Well, ns2 doesn't respond. Neither do ns3, nor as was always obvious, ns4. What does the trace looks like (again, slightly edited):
12:20:22.532877 IP 192.168.x.y.34364 > 31539+ NS? sbs-music.com. (31)
12:20:27.532864 IP 192.168.x.y.34364 > 31539+ NS? sbs-music.com. (31)
12:20:32.533445 IP 192.168.x.y.45322 > 5827+ NS? sbs-music.com. (31)
12:20:52.733389 IP > 192.168.x.y.34364: 31539 ServFail 0/0/0 (31)
12:20:52.733410 IP 192.168.x.y > ICMP 192.168.x.y udp port 34364 unreachable, length 67
12:20:52.734042 IP > 192.168.x.y.45322: 5827 ServFail 0/0/0 (31)
12:20:52.734053 IP 192.168.x.y > ICMP 192.168.x.y udp port 45322 unreachable, length 67
Now, we have interesting situation here. First, DNS server for sbs-music.com takes significant time to answer, and when it answers our local DNS isn't listening any more (thus those ICMP error messages). But, in the end local DNS concludes correctly that something's wrong with the name servers for that domain.

The final piece of puzzle comes from querying com name server for sbs-music.com domain:
$ host -t ns sbs-music.com l.gtld-servers.net.
Using domain server:
Name: l.gtld-servers.net.

sbs-music.com has no NS record
Obviously, this domain was removed. It is clear now that this sbs-music.com domain existed for some time, and DNS server that produces error cached IP address of its domain name. If it were to query com domain name server again, it would receive NXDOMAIN error and properly notify mail server.

To see currently cached entries of BIND name server use the following command:
rndc dumpdb
Then look for file cache_dump.db in /var/named/data (or in /var/named/chroot/var/named/data if you are running BIND in chroot). It is a textual file that you can inspect with text editor, less or something similar. In my case there were the following lines there:
; glue
sbs-music.com.          172669  NS      ns1.shepherdhosting.com.
                        172669  NS      ns2.shepherdhosting.com.
                        172669  NS      ns3.shepherdhosting.com.
                        172669  NS      ns4.shepherdhosting.com.

To flush a single entry use the following command
rndc flushname sbs-music.com internal
This removes sbs-music.com from caches in internal view (I configured viewes so that server behaves differently depending on who asks it). Yet, this didn't help. Then I tried to flush everything in internal view using:
rndc flush internal
But this, while helped, didn't actually solve the problem. Namely, looking into packet trace it turns out that  BIND server receives from b.gtld-servers.net. that given domain doesn't exist and from somewhere it pulls the old IP address!

So, I finally decided to look into log and there are a lot of the following messages:
error (FORMERR) resolving 'sbs-music.com/NS/IN':
along with the one that triggered all this. Ok, I also tried to upgrade bind, but no luck, still SERVFAIL errors.

Now, its time for heavy artillery, or Wireshark. So I saved packet trace and loaded it into Wireshark. Guess what! Wireshark crashed on requests sent by local DNS server!?

Ok, after a bit more fiddling I realised that some com domain name servers do know for this domain. But note the difference between the output of host command (that I've used previously) and nslookup command:
$ nslookup -type=ns sbs-music.com

Non-authoritative answer:
*** Can't find sbs-music.com: No answer

Authoritative answers can be found from:
sbs-music.com nameserver = ns1.shepherdhosting.com.
sbs-music.com nameserver = ns2.shepherdhosting.com.
sbs-music.com nameserver = ns3.shepherdhosting.com.
sbs-music.com nameserver = ns4.shepherdhosting.com.
ns1.shepherdhosting.com internet address =
ns2.shepherdhosting.com internet address =
ns3.shepherdhosting.com internet address =
ns4.shepherdhosting.com internet address =
Ok, if this com server tells me that there is no this domain, why is then it pointing me to those nameservers? dig command is a bit more informative:
$ dig @ sbs-music.com

; <<>> DiG 9.9.2-P1-RedHat-9.9.2-6.P1.fc18 <<>> @ sbs-music.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64382
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 5
;; WARNING: recursion requested but not available

; EDNS: version: 0, flags:; udp: 4096
;sbs-music.com. IN A

sbs-music.com. 172800 IN NS ns1.shepherdhosting.com.
sbs-music.com. 172800 IN NS ns2.shepherdhosting.com.
sbs-music.com. 172800 IN NS ns3.shepherdhosting.com.
sbs-music.com. 172800 IN NS ns4.shepherdhosting.com.

ns1.shepherdhosting.com. 172800 IN A
ns2.shepherdhosting.com. 172800 IN A
ns3.shepherdhosting.com. 172800 IN A
ns4.shepherdhosting.com. 172800 IN A
;; Query time: 149 msec
;; WHEN: Tue Feb  5 13:49:22 2013
;; MSG SIZE  rcvd: 194
What a mess?!

Then, I googled a bit to find why BIND is returning SERVFAIL instead of NXDOMAIN. This is something interesting that I found:
  • BIND could have returned SERVFAIL instead of NXDOMAIN responses for nonexistent resource records from the unsigned child zone if the parent zone was signed. (BZ#643012)
Trying to lookup that bug in RedHat's Bugzilla gives be big red square which tells me that I'm not allowed to see it (despite being logged in) so it's some security issue obviously!?

Looking at how different BIND versions behave I get the following results:
  • bind-9.3.6-20.P1.el5_8.6 and bind-9.9.2-6.P1.fc18.x86_64 return NXDOMAIN.
  • bind-9.8.2-0.10.rc1.el6_3.6.x86_64 and bind-9.8.2-0.10.rc1.el6_3.5.x86_64 return SERVFAIL.
Could it be somehow related to DNSSEC?

Ok, let me conclude. The problem is that name servers for domain sbs-music.com aren't correctly configured, while the domain itself is registered with com domain servers. This triggers different behavior from BIND. So, there are two possibilities from here:
  1. Persuade somehow BIND to return NXDOMAIN instead of SERVFAIL.
  2. Find what is causing queries for this domain in the first place.
Stay tuned... :)

No comments:

About Me

scientist, consultant, security specialist, networking guy, system administrator, philosopher ;)