[ewg] nfs-rdma hanging with Ubuntu 9.10

Ross Smith myxiplx at googlemail.com
Fri Jan 15 05:28:31 PST 2010


Hi folks, it's me again I'm afraid.

Thanks to the help from this list, I have ipoib working, however I
seem to be having a few problems, not least of which is commands
hanging if I attempt to use nfs-rdma.

Although the rmda mount command completes, the system then becomes
unresponsive if I attempt any command such as 'ls', even outside of
the mounted folder.  Umount also fails with the error "device is
busy".

If anybody can spare the time to help it would be very much
appreciated.  I do seem to have a lot of warnings in the logs, but
although I've tried searching for solutions haven't found anything
yet.


System details
============
 - Ubuntu 9.10
   (kernel 2.6.31)
 - Mellanox ConnectX QDR card
 - Flextronics DDR switch
 - OpenSolaris NFS server, running one of the latest builds for troubleshooting
 - OpenSM running on another Ubuntu 9.10 box with a Mellanox
Infinihost III Lx card

I am using the kernel drivers only, I have not installed OFED on this machine.


Loading driver
============
The driver appears to load, and ipoib works, but there are rather a
lot of warnings from dmesg.

I am loading the driver with:
$ sudo modprobe mlx4_ib
$ sudo modprobe ib_ipoib
$ sudo ifconfig ib0 192.168.101.4 netmask 255.255.255.0 up

And that leaves me with:
$ lsmod
Module                  Size  Used by
ib_ipoib               72452  0
ib_cm                  37196  1 ib_ipoib
ib_sa                  19812  2 ib_ipoib,ib_cm
mlx4_ib                42720  0
ib_mad                 37524  3 ib_cm,ib_sa,mlx4_ib
ib_core                57884  5 ib_ipoib,ib_cm,ib_sa,mlx4_ib,ib_mad
binfmt_misc             8356  1
ppdev                   6688  0
psmouse                56180  0
serio_raw               5280  0
mlx4_core              84728  1 mlx4_ib
joydev                 10272  0
lp                      8964  0
parport                35340  2 ppdev,lp
iptable_filter          3100  0
ip_tables              11692  1 iptable_filter
x_tables               16544  1 ip_tables
usbhid                 38208  0
e1000e                122124  0


At this point I can ping the Solaris server over the IP link.
Although I do need to issue a ping from Solaris before I get a reply.
I'm mentioning that it in case it's relevant, but at this point I'm
assuming that's just a firewall setting on the server.

But although ping works, I am starting to get some dmesg warnings, I
just don't know if they are relevant:
[  313.692072] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
[  313.885220] ADDRCONF(NETDEV_UP): ib0: link is not ready
[  316.880450] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[  316.880573] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[  316.880789] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[  320.873613] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[  327.147114] ib0: no IPv6 routers present
[  328.861550] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[  344.834440] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[  360.808312] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[  376.782186] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22

And at this point however, regular nfs mounts work fine over the ipoib link:
$ sudo mount 192.168.100.1:/test/rdma ./nfstest

Bug again, that again adds warnings to dmesg:
[  826.456902] RPC: Registered udp transport module.
[  826.456905] RPC: Registered tcp transport module.
[  841.553135] svc: failed to register lockdv1 RPC service (errno 97).

And the speed is definitely nothing to write home about, copying a
100mb file takes over 10 seconds:
$ time cp ./100mb ./100mb2

real	0m10.472s
user	0m0.000s
sys	0m1.248s

And again with warnings appearing in dmesg:
[  872.373364] ib0: post_send failed
[  872.373407] ib0: post_send failed
[  872.373448] ib0: post_send failed

I think this is a client issue rather than a problem on the server as
the same test on an OpenSolaris client takes under half a second:
# time cp ./100mb ./100mb2

real    0m0.334s
user    0m0.001s
sys     0m0.176s

Although the system is definitely not right, my long term aim is to
run nfs-rdma on this system, so my next test was to try that and see
if the speed improved:

$ sudo umount ./nfstest
$ sudo mount -o rdma,port=20049 192.168.101.1:/test/rdma ./nfstest

That takes a long time to connect.  It does eventually go through, but
only after the following errors in dmesg:

[ 1140.698659] RPC: Registered rdma transport module.
[ 1155.697672] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1160.688455] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1160.693818] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1160.695131] svc: failed to register lockdv1 RPC service (errno 97).
[ 1170.676049] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1170.681458] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1190.647355] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1190.652778] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1220.602353] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1220.607809] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1250.557397] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1250.562817] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1281.522735] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1281.528442] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1311.477845] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1311.482983] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1341.432758] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1341.438212] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4

However, at this point my shell session becomes unresponsive if I
attempt so much as a 'ls'.  The system hasn't hung completely however
as I can still connect another ssh session and restart with
$ sudo init 6

Can anybody help?  Is there anything obvious I am doing wrong here?

thanks,

Ross



More information about the ewg mailing list