<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Hello,<br>
<br>
I agree, update the HCA firmware before proceeding. [The description
in Bugzilla Bug <a
href="https://bugs.openfabrics.org/show_bug.cgi?id=1711">1711</a>
seems to match the problem that you are observing.]<br>
<br>
Also, if you want to help diagnose the "ib0: post_send failed", take a
look at
<a class="moz-txt-link-freetext" href="http://lists.openfabrics.org/pipermail/general/2009-July/061118.html">http://lists.openfabrics.org/pipermail/general/2009-July/061118.html</a>.<br>
<br>
-David<br>
<br>
Ross Smith wrote:
<blockquote
cite="mid:7b160d241001180902w2724a017s95674dacbb321747@mail.gmail.com"
type="cite">
<pre wrap="">Hi Tom,
No, you're right - I'm just using the support that's built into the
kernel, and I agree, diagnostics from Solaris is proving very tricky.
I do have a Solaris client connected to this and showing some decent
speeds (over 900Mb/s), but I've been thinking that I might need to get
a Linux server running for testing before I spend much more time
trying to get the two separate systems working.
However, I have found over the weekend that I'm running older firmware
and need that updating. I'd missed that in the nfs-rdma readme so I'm
pretty sure that's going to be causing problems. I'm trying to get
that resolved before I do too much other testing.
Regular NFS running over the ipoib link seems fine, and I don't get
any extra warnings using that. I can also run a full virtual machine
quite happily over NFS, so despite the warnings, the link does appear
stable and reliable.
Ross
On Mon, Jan 18, 2010 at 4:30 PM, Tom Tucker <a class="moz-txt-link-rfc2396E" href="mailto:tom@opengridcomputing.com"><tom@opengridcomputing.com></a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">Hi Ross:
I would check that you have IB RDMA actually working. The core transport
issues suggest that there may be network problems that will prevent NFSRDMA
from working properly.
The first question is whether or not you are actually using OFED. You're not
-- right? You're just using the support built into the 2.6.31 kernel?
Second I don't think the mount is actually completing. I think the command
is returning, but the mount never actually finishes. It's sitting there hung
trying to perform the first RPC to the server (RPC_NOP) and it's never
succeeding. That's why you see all those connect/disconnect messages in your
log file. It tries to send, gets an error, disconnects, reconnects, tries to
send .... you get the picture.
Step 1 I think would be to ensure that you actually have IB up and running.
IPoIB between the two seems a little dodgy given the dmesg log. Do you have
another Linux box you can use to test out connectivity/configuration with
your victim? There are test programs in OFED (rping) that would help you do
this, but I don't believe they are available on Solaris.
Tom
Steve Wise wrote:
</pre>
<blockquote type="cite">
<pre wrap="">nfsrdma hang on ewg...
-------- Original Message --------
Subject: [ewg] nfs-rdma hanging with Ubuntu 9.10
Date: Fri, 15 Jan 2010 13:28:31 +0000
From: Ross Smith <a class="moz-txt-link-rfc2396E" href="mailto:myxiplx@googlemail.com"><myxiplx@googlemail.com></a>
To: <a class="moz-txt-link-abbreviated" href="mailto:ewg@openfabrics.org">ewg@openfabrics.org</a>
Hi folks, it's me again I'm afraid.
Thanks to the help from this list, I have ipoib working, however I
seem to be having a few problems, not least of which is commands
hanging if I attempt to use nfs-rdma.
Although the rmda mount command completes, the system then becomes
unresponsive if I attempt any command such as 'ls', even outside of
the mounted folder. Umount also fails with the error "device is
busy".
If anybody can spare the time to help it would be very much
appreciated. I do seem to have a lot of warnings in the logs, but
although I've tried searching for solutions haven't found anything
yet.
System details
============
- Ubuntu 9.10
(kernel 2.6.31)
- Mellanox ConnectX QDR card
- Flextronics DDR switch
- OpenSolaris NFS server, running one of the latest builds for
troubleshooting
- OpenSM running on another Ubuntu 9.10 box with a Mellanox
Infinihost III Lx card
I am using the kernel drivers only, I have not installed OFED on this
machine.
Loading driver
============
The driver appears to load, and ipoib works, but there are rather a
lot of warnings from dmesg.
I am loading the driver with:
$ sudo modprobe mlx4_ib
$ sudo modprobe ib_ipoib
$ sudo ifconfig ib0 192.168.101.4 netmask 255.255.255.0 up
And that leaves me with:
$ lsmod
Module Size Used by
ib_ipoib 72452 0
ib_cm 37196 1 ib_ipoib
ib_sa 19812 2 ib_ipoib,ib_cm
mlx4_ib 42720 0
ib_mad 37524 3 ib_cm,ib_sa,mlx4_ib
ib_core 57884 5 ib_ipoib,ib_cm,ib_sa,mlx4_ib,ib_mad
binfmt_misc 8356 1
ppdev 6688 0
psmouse 56180 0
serio_raw 5280 0
mlx4_core 84728 1 mlx4_ib
joydev 10272 0
lp 8964 0
parport 35340 2 ppdev,lp
iptable_filter 3100 0
ip_tables 11692 1 iptable_filter
x_tables 16544 1 ip_tables
usbhid 38208 0
e1000e 122124 0
At this point I can ping the Solaris server over the IP link.
Although I do need to issue a ping from Solaris before I get a reply.
I'm mentioning that it in case it's relevant, but at this point I'm
assuming that's just a firewall setting on the server.
But although ping works, I am starting to get some dmesg warnings, I
just don't know if they are relevant:
[ 313.692072] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4,
2008)
[ 313.885220] ADDRCONF(NETDEV_UP): ib0: link is not ready
[ 316.880450] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 316.880573] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[ 316.880789] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 320.873613] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 327.147114] ib0: no IPv6 routers present
[ 328.861550] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 344.834440] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 360.808312] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
[ 376.782186] ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
And at this point however, regular nfs mounts work fine over the ipoib
link:
$ sudo mount 192.168.100.1:/test/rdma ./nfstest
Bug again, that again adds warnings to dmesg:
[ 826.456902] RPC: Registered udp transport module.
[ 826.456905] RPC: Registered tcp transport module.
[ 841.553135] svc: failed to register lockdv1 RPC service (errno 97).
And the speed is definitely nothing to write home about, copying a
100mb file takes over 10 seconds:
$ time cp ./100mb ./100mb2
real 0m10.472s
user 0m0.000s
sys 0m1.248s
And again with warnings appearing in dmesg:
[ 872.373364] ib0: post_send failed
[ 872.373407] ib0: post_send failed
[ 872.373448] ib0: post_send failed
I think this is a client issue rather than a problem on the server as
the same test on an OpenSolaris client takes under half a second:
# time cp ./100mb ./100mb2
real 0m0.334s
user 0m0.001s
sys 0m0.176s
Although the system is definitely not right, my long term aim is to
run nfs-rdma on this system, so my next test was to try that and see
if the speed improved:
$ sudo umount ./nfstest
$ sudo mount -o rdma,port=20049 192.168.101.1:/test/rdma ./nfstest
That takes a long time to connect. It does eventually go through, but
only after the following errors in dmesg:
[ 1140.698659] RPC: Registered rdma transport module.
[ 1155.697672] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1160.688455] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1160.693818] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1160.695131] svc: failed to register lockdv1 RPC service (errno 97).
[ 1170.676049] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1170.681458] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1190.647355] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1190.652778] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1220.602353] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1220.607809] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1250.557397] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1250.562817] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1281.522735] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1281.528442] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1311.477845] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1311.482983] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
[ 1341.432758] rpcrdma: connection to 192.168.101.1:20049 closed (-103)
[ 1341.438212] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
memreg 5 slots 32 ird 4
However, at this point my shell session becomes unresponsive if I
attempt so much as a 'ls'. The system hasn't hung completely however
as I can still connect another ssh session and restart with
$ sudo init 6
Can anybody help? Is there anything obvious I am doing wrong here?
thanks,
Ross
_______________________________________________
ewg mailing list
<a class="moz-txt-link-abbreviated" href="mailto:ewg@lists.openfabrics.org">ewg@lists.openfabrics.org</a>
<a class="moz-txt-link-freetext" href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg">http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg</a>
</pre>
</blockquote>
<pre wrap="">
</pre>
</blockquote>
<pre wrap=""><!---->_______________________________________________
ewg mailing list
<a class="moz-txt-link-abbreviated" href="mailto:ewg@lists.openfabrics.org">ewg@lists.openfabrics.org</a>
<a class="moz-txt-link-freetext" href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg">http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg</a>
</pre>
</blockquote>
</body>
</html>