Strange NFS-related messages (related to lockd/statd)

Discussion:

(too old to reply)

Jeremy Chadwick

2010-03-29 16:58:30 UTC

I recently brought up rpc.lockd and rpc.statd on all of our NFS clients
(mixed RELENG_6, RELENG_7, and RELENG_8), and our NFS server (RELENG_8).

All clients had nfs_client_enable="yes" in rc.conf prior to their last
reboot, but lacked rpcbind_enable="yes", rpc_lockd_enable="yes", and
rpc_statd_enable="yes" prior to the below.

The 8.x clients started rpcbind, rpc.lockd, rpc.statd -- then said:

NLM: failed to contact remote rpcbind, stat = 0, port = 0
Can't start NLM - unable to contact NSM

The 7.x clients started rpcbind, rpc.lockd, rpc.statd -- then said:

Can't start NLM - unable to contact NSM

One of the 7.x clients also kernel panic'd when starting rpc.lockd,
in some nlm_* kernel functions. Looking at commits showed that the bug
that caused the panic was fixed in a later 7.x release.

The 7.x clients started rpcbind, rpc.lockd, rpc.statd -- and said
nothing.

The above daemons were all started in that order, per the FreeBSD
Handbook.

I can't find a definition of what the acronyms NLM and NSM stand for,
nor does Googling the error messages return relevant results (except one
FreeBSD committer reporting similar, but nobody replied). I don't know
the implications of these messages.

The only thing I can think might cause such errors would be the fact
that these machines all have dual NICs with firewall rules applied only
to their primary (WAN-side) interface. The NFS server exists only on
the private (LAN-side) interface. I'm thinking rpcbind may have tried
to "do stuff" on the WAN interface, since no -h option was applied.

I haven't tried making use of -h yet, nor have I tried restarting the
daemons to see if the errors recur (or if it was just a one-time thing).

Any information/tips/advice would be appreciated. Danke!

Rick Macklem

2010-03-29 21:55:57 UTC

Permalink

Post by Jeremy Chadwick
I can't find a definition of what the acronyms NLM and NSM stand for,
nor does Googling the error messages return relevant results (except one
FreeBSD committer reporting similar, but nobody replied). I don't know
the implications of these messages.

NLM - Network Lock Manager
NSM - Network Status Monitor (I think?)

These two protocols (separate from NFS) were what Sun implemented in
the 1980s to provide locking on NFS mount points. Imho, these protocols
were poorly designed:
- The NLM allows blocking locks at the server, which can cause assorted
nasty issues when the client crashes or gets network partitioned.
- It also depended on the NSM to decide when machines were up/down and
the NSM protocol basically did this in a rather poor way.

A big part of NFSv4 was the integration of locking, in order to avoid
use of the above. (As you might have guessed, lockd and statd implement
the above two protocols.

rick

Rick Macklem

2010-03-30 14:33:08 UTC

Permalink

Post by Jeremy Chadwick
I recently brought up rpc.lockd and rpc.statd on all of our NFS clients
(mixed RELENG_6, RELENG_7, and RELENG_8), and our NFS server (RELENG_8).
All clients had nfs_client_enable="yes" in rc.conf prior to their last
reboot, but lacked rpcbind_enable="yes", rpc_lockd_enable="yes", and
rpc_statd_enable="yes" prior to the below.
NLM: failed to contact remote rpcbind, stat = 0, port = 0
Can't start NLM - unable to contact NSM
Can't start NLM - unable to contact NSM

Oh, I forgot to mention..I can't help much, but these protocols/daemons
are SunRPC, so they will be using portmapper (now called rpcbind) to get
port #s assigned dynamically. I also believe (not sure, don't know much
about it) that the NSM will poll for other machines, so it needs to be
able to talk to all clients and server(s), including doing IP broadcast
that gets to them all. (These were designed in the 1980s for a LAN, which
was just a chunk of coax in those days:-)

Hope this helps, rick

Jeremy Chadwick

2010-03-30 15:39:36 UTC

Permalink

Post by Rick Macklem

In fact it did! Your hint lead me to try my earlier idea: using the -h
flag to rpcbind.

Turns out lockd wasn't running on any of the systems (rpcinfo didn't
show it, and ps didn't show it). I ended up modifying all of the boxes
to use:

rpcbind_flags="-h <ipaddr of em1>"

(Where em1=LAN, em0=WAN. em0 contains the default route as well)

Restarted rpcbind + statd + lockd (in that order). Voila, everything
started up, and no messages. rpcinfo shows all correct services. So my
guess is that by binding to INADDR_ANY by default, packets were going
out the primary interface (em0) or going to broadcast on em0 -- which
would return nothing, since pf blocked such packets. Makes sense to me
anyway.

Thanks!