[ewg] nfs-rdma hanging with Ubuntu 9.10
Ross Smith
myxiplx at googlemail.com
Fri Jan 15 05:28:31 PST 2010
Hi folks, it's me again I'm afraid.
Thanks to the help from this list, I have ipoib working, however I
seem to be having a few problems, not least of which is commands
hanging if I attempt to use nfs-rdma.
Although the rmda mount command completes, the system then becomes
unresponsive if I attempt any command such as 'ls', even outside of
the mounted folder. Umount also fails with the error "device is
busy".
If anybody can spare the time to help it would be very much
appreciated. I do seem to have a lot of warnings in the logs, but
although I've tried searching for solutions haven't found anything
yet.
System details
============
- Ubuntu 9.10
(kernel 2.6.31)
- Mellanox ConnectX QDR card
- Flextronics DDR switch
- OpenSolaris NFS server, running one of the latest builds for troubleshooting
- OpenSM running on another Ubuntu 9.10 box with a Mellanox
Infinihost III Lx card
I am using the kernel drivers only, I have not installed OFED on this machine.
Loading driver
============
The driver appears to load, and ipoib works, but there are rather a
lot of warnings from dmesg.
I am loading the driver with:
$ sudo modprobe mlx4_ib
$ sudo modprobe ib_ipoib
$ sudo ifconfig ib0 192.168.101.4 netmask 255.255.255.0 up
And that leaves me with:
$ lsmod
Module Size Used by
ib_ipoib 72452 0
ib_cm 37196 1 ib_ipoib
ib_sa 19812 2 ib_ipoib,ib_cm
mlx4_ib 42720 0
ib_mad 37524 3 ib_cm,ib_sa,mlx4_ib
ib_core 57884 5 ib_ipoib,ib_cm,ib_sa,mlx4_ib,ib_mad
binfmt_misc 8356 1
ppdev 6688 0
psmouse 56180 0
serio_raw 5280 0
mlx4_core 84728 1 mlx4_ib
joydev 10272 0
lp 8964 0
parport 35340 2 ppdev,lp
iptable_filter 3100 0
ip_tables 11692 1 iptable_filter
x_tables 16544 1 ip_tables
usbhid 38208 0
e1000e 122124 0
At this point I can ping the Solaris server over the IP link.
Although I do need to issue a ping from Solaris before I get a reply.
I'm mentioning that it in case it's relevant, but at this point I'm
assuming that's just a firewall setting on the server.
But although ping works, I am starting to get some dmesg warnings, I
just don't know if they are relevant:
[ 313.692072] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
[ 313.885220] ADDRCONF(NETDEV_UP): ib0: link is not ready
[ 316.880450] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 316.880573] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[ 316.880789] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 320.873613] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 327.147114] ib0: no IPv6 routers present
[ 328.861550] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 344.834440] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 360.808312] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 376.782186] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
And at this point however, regular nfs mounts work fine over the ipoib link:
$ sudo mount 192.168.100.1:/test/rdma ./nfstest
Bug again, that again adds warnings to dmesg:
[ 826.456902] RPC: Registered udp transport module.
[ 826.456905] RPC: Registered tcp transport module.
[ 841.553135] svc: failed to register lockdv1 RPC service (errno 97).
And the speed is definitely nothing to write home about, copying a
100mb file takes over 10 seconds:
$ time cp ./100mb ./100mb2
real 0m10.472s
user 0m0.000s
sys 0m1.248s
And again with warnings appearing in dmesg:
[ 872.373364] ib0: post_send failed
[ 872.373407] ib0: post_send failed
[ 872.373448] ib0: post_send failed
I think this is a client issue rather than a problem on the server as
the same test on an OpenSolaris client takes under half a second:
# time cp ./100mb ./100mb2
real 0m0.334s
user 0m0.001s
sys 0m0.176s
Although the system is definitely not right, my long term aim is to
run nfs-rdma on this system, so my next test was to try that and see
if the speed improved:
$ sudo umount ./nfstest
$ sudo mount -o rdma,port=20049 192.168.101.1:/test/rdma ./nfstest
That takes a long time to connect. It does eventually go through, but
only after the following errors in dmesg:
[ 1140.698659] RPC: Registered rdma transport module.
[ 1155.697672] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1160.688455] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1160.693818] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1160.695131] svc: failed to register lockdv1 RPC service (errno 97).
[ 1170.676049] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1170.681458] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1190.647355] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1190.652778] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1220.602353] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1220.607809] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1250.557397] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1250.562817] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1281.522735] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1281.528442] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1311.477845] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1311.482983] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1341.432758] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1341.438212] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
However, at this point my shell session becomes unresponsive if I
attempt so much as a 'ls'. The system hasn't hung completely however
as I can still connect another ssh session and restart with
$ sudo init 6
Can anybody help? Is there anything obvious I am doing wrong here?
thanks,
Ross
More information about the ewg
mailing list