Yesterday, I was pinging a server when I noticed that the output of ping seemed
to be rather slow. In fact, I'd noticed it before but never really thought about
it until I was pinging anther server at the same time and the saw the drastic
difference in output speeds.
Pinging google.co.uk, there was a ping every second:
fpr@callisto:~$ ping google.co.uk | perl -ne 'use Time::Format; print "$time{\"hh:mm:ss.mmm\"} - $_"'
14:44:34.898 - PING google.co.uk (74.125.77.104) 56(84) bytes of data.
14:44:34.915 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=1 ttl=238 time=32.9 ms
14:44:35.883 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=2 ttl=238 time=35.9 ms
14:44:36.882 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=3 ttl=238 time=33.4 ms
On another server, it was closer to five:
fpr@callisto:~$ ping server1.fsckvps.com | perl -ne 'use Time::Format; print "$time{\"hh:mm:ss.mmm\"} - $_"'
14:49:42.389 - PING server1.fsckvps.com (66.71.248.146) 56(84) bytes of data.
14:49:42.408 - 64 bytes from 66.71.248.146: icmp_seq=1 ttl=45 time=122 ms
14:49:47.500 - 64 bytes from 66.71.248.146: icmp_seq=2 ttl=45 time=123 ms
14:49:52.625 - 64 bytes from 66.71.248.146: icmp_seq=3 ttl=45 time=122 ms
This didn't make much sense since the round-trip times were small by comparison
and ping sends one request per second by default. A Google search indicated that
the problem might lie with my
resolv.conf file. Unfortunately, mine
seemed to be fine and my local DNS server was completely responsive. However, if
I pinged the server by IP address instead of by hostname, the delay was gone.
To see what was blocking, I ran strace on
ping.
15:04:26.789 - munmap(0x7f69ed319000, 129482) = 0
15:04:26.790 - socket(PF_FILE, SOCK_STREAM, 0) = 4
15:04:26.790 - fcntl(4, F_GETFD) = 0
15:04:26.790 - fcntl(4, F_SETFD, FD_CLOEXEC) = 0
15:04:26.790 - connect(4, {sa_family=AF_FILE, path="/var/run/avahi-daemon/socket"...}, 110) = 0
15:04:26.790 - fcntl(4, F_GETFL) = 0x2 (flags O_RDWR)
15:04:26.790 - fstat(4, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
15:04:26.790 - mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f69ed359000
15:04:26.791 - lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
15:04:26.791 - write(4, "RESOLVE-ADDRESS 66.71.248.146\n"..., 30) = 30
15:04:31.789 - read(4, "-15 Timeout reached\n"..., 4096) = 20
15:04:31.789 - close(4) = 0
It was blocking each ping on a read from /var/run/avahi-daemon/socket, a socket
used by Avahi, an implementation of the network
auto-configuration and service discovery mechanism
Zeroconf that
augments DNS. Killing the Avahi daemon solved the problem, but I still wanted to
work out why ping was talking to Avahi and why it only occurred on certain
hosts, so I ran ltrace:
15:14:33.062 - gettimeofday(0x7fff90ec0550, NULL) = 0
15:14:33.062 - gettimeofday(0x7fff90ec0520, NULL) = 0
15:14:33.062 - memcpy(0x00608988, "\311\004\030J", 16) = 0x00608988
15:14:33.062 - sendmsg(3, 0x6077c0, 2048, 2, 61091) = 64
15:14:33.184 - recvmsg(3, 0x7fff90ec15f0, 0, 0, 61091) = 84
15:14:38.193 - gethostbyaddr("BG\370\222T\315(\002", 4, 2) = NULL
15:14:38.193 - inet_ntoa(0x92f84742) = "66.71.248.146"
15:14:38.194 - strcpy(0x006078e0, "66.71.248.146") = 0x006078e0
I then wrote a little test in C just to check that I could replicate the delay
with gethostbyaddr(). I could, and it
was then that I finally realised that the delay occurred when pinging hosts that had
no PTR record (reverse DNS). Slightly confusingly, ping performs a reverse DNS lookup
on the IP address when provided with a hostname, but not when given an IP address.
gethostbyaddr() was calling Avahi because it had plugged itself into the glibc
resolver using NSS. When an
attempt to resolve an IP address to a hostname failed, glibc would then call
Avahi to try to find it. For whatever reason, Avahi cannot answer this request
instantly and times out after 5 long seconds. Avahi also resolves host names
to IP addresses but the delay in looking up unresolvable host names only occurs
if the domain is under the .local pseudo top-level
domain.
The Debian package dependencies make it a bit difficult to remove Avahi so the
easiest way to fix this is to remove references to mdns4 from
/etc/nsswitch.conf. If you want to
kill the daemon entirely then you can always disable its init script of course.
Amusingly (or tragically) this bug is listed in
Ubuntu and
Debian bug reports
which are both over two years old. At time time of writing, it still exists in
Ubuntu Jaunty and Debian testing. What makes me really angry is that somewhere,
someone decided that it would be a great idea to enable this daemon by default
on desktop installs and as a result, performance of applications is being
degraded. Obviously ping doesn't matter that much, but as mentioned in the bug
reports, this hits people using ssh and IMAP as well, causing anything from
mild delays to almost unusable systems. The worst part of this is that many
people (and there could be a lot) suffering these issues are probably
attributing them to a slow network, or packet loss, or ssh key verification or
just about anything else other than a dubiously designed daemon running on
their own machine. The fact that it only occurs on certain hosts only
reinforces this and unless they suddenly realise that this delay is their
local machine's fault and put a lot of effort into debugging, they'll
probably never know.