[ewg] [Fwd: nfs-rdma hanging with Ubuntu 9.10]

Ross Smith myxiplx at googlemail.com
Tue Jan 26 08:23:26 PST 2010


Interesting, the single '>' didn't work for me, it removed the tcp and
udp entries, leaving me with just rdma.  It looks like you do need the
>> on Ubuntu 9.10.


On Tue, Jan 26, 2010 at 4:20 PM, Tom Tucker <tom at opengridcomputing.com> wrote:
> Ross Smith wrote:
>>
>> No problem, but I think you need an extra > too :)
>>
>> # echo "rdma 20049" >> /proc/fs/nfsd/portlist
>>
>
> Actually, it's not a "real" file. So just the single '>' will work fine.
> There is logic inside the kernel
> that handles the write and converts the 'rdma 20049' to calls inside the
> kernel that create a
> listening endpoint for the rdma transport.
>>
>> And that was enough to get me going, although there was one more step
>> I'd missed:
>>
>> # mount 192.168.101.5:/home/ross/nfsexport ./nfstest -o
>> proto=rdma,port=20049
>> mount.nfs: Operation not permitted
>>
>> Googling that lead me to modify /etc/exports on the server to add the
>> insecure option.  With that added it works fine.
>>
>>
>
> Awesome.
>
>> Now I just need to get it connecting to Solaris without crashing the
>> server :)
>>
>> Many thanks for all the help.
>>
>> Ross
>>
>>
>>
>> On Tue, Jan 26, 2010 at 3:49 PM, Tom Tucker <tom at opengridcomputing.com>
>> wrote:
>>
>>>
>>> Ross Smith wrote:
>>>
>>>>
>>>> Hmm, the portlist doesn't look good:
>>>>
>>>> $ cat /proc/fs/nfsd/portlist
>>>> tcp 2049
>>>> udp 2049
>>>>
>>>>
>>>>
>>>
>>> No it looks great, that's an easy one! No one is listening on 20049, so
>>> you
>>> get 111 (ECONNREFUSED)
>>>
>>>
>>>>
>>>> But attempting to modify that fails:
>>>>
>>>> # echo 20049 > /proc/fs/nfsd/portlist
>>>> -bash: echo: write error: Bad file descriptor
>>>>
>>>>
>>>>
>>>
>>> That's because I gave you the wrong syntax for the write command. It
>>> should
>>> be the following:
>>>
>>> # echo "rdma 20049" > /proc/fs/nfsd/portlist
>>>
>>> Sorry about that.
>>>
>>> Tom
>>>
>>>
>>>>
>>>> And I get similar problems attempting to enable the debugging logs:
>>>>
>>>> # echo 32767 > /proc/sys/sunrpc/rpc_debug
>>>> -bash: /proc/sys/sunrpc/rpc_debug: Permission denied
>>>>
>>>> Up to that point through everything looks like it's loading fine:
>>>>
>>>> Ubuntu server:
>>>> ===========
>>>> # modprobe mlx4_ib
>>>> # modprobe ib_ipoib
>>>> # ifconfig ib0 192.168.101.5 netmask 255.255.255.0 up
>>>>
>>>> dmesg results:
>>>> [  456.793661] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April
>>>> 4,
>>>> 2008)
>>>> [  456.987043] ADDRCONF(NETDEV_UP): ib0: link is not ready
>>>> [  459.988683] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
>>>> [  470.686631] ib0: no IPv6 routers present
>>>>
>>>> # modprobe svcrdma
>>>> # /etc/init.d/nfs-kernel-server restart
>>>>
>>>> dmesg:
>>>> [  524.520198] nfsd: last server has exited, flushing export cache
>>>> [  529.292366] svc: failed to register lockdv1 RPC service (errno 97).
>>>> [  529.293289] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
>>>> recovery directory
>>>> [  529.293304] NFSD: starting 90-second grace period
>>>>
>>>> Ubuntu client:
>>>> ==========
>>>> # modprobe mlx4_ib
>>>> # modprobe ib_ipoib
>>>> # ifconfig ib0 192.168.101.4 netmask 255.255.255.0 up
>>>>
>>>> dmesg:
>>>> [   97.576507] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April
>>>> 4,
>>>> 2008)
>>>> [   97.769582] ADDRCONF(NETDEV_UP): ib0: link is not ready
>>>> [  100.765318] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
>>>> [  110.899591] ib0: no IPv6 routers present
>>>>
>>>> # modprobe xprtrdma
>>>>
>>>> dmesg:
>>>> [  169.269689] RPC: Registered udp transport module.
>>>> [  169.269691] RPC: Registered tcp transport module.
>>>> [  169.289755] RPC: Registered rdma transport module.
>>>>
>>>> Ross
>>>>
>>>>
>>>> On Tue, Jan 26, 2010 at 2:32 PM, Tom Tucker <tom at opengridcomputing.com>
>>>> wrote:
>>>>
>>>>
>>>>>
>>>>> Ross Smith wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> A quick addendum to that, I've just had a look at rpcinfo on both the
>>>>>> Ubuntu and Solaris NFS servers, does this indicate that nfs-rdma is
>>>>>> not actually running?
>>>>>>
>>>>>> rpcinfo -p
>>>>>>  program vers proto   port
>>>>>>  100000    2   tcp    111  portmapper
>>>>>>  100000    2   udp    111  portmapper
>>>>>>  100024    1   udp  37031  status
>>>>>>  100024    1   tcp  58463  status
>>>>>>  100021    1   udp  34989  nlockmgr
>>>>>>  100021    3   udp  34989  nlockmgr
>>>>>>  100021    4   udp  34989  nlockmgr
>>>>>>  100021    1   tcp  47979  nlockmgr
>>>>>>  100021    3   tcp  47979  nlockmgr
>>>>>>  100021    4   tcp  47979  nlockmgr
>>>>>>  100003    2   udp   2049  nfs
>>>>>>  100003    3   udp   2049  nfs
>>>>>>  100003    4   udp   2049  nfs
>>>>>>  100003    2   tcp   2049  nfs
>>>>>>  100003    3   tcp   2049  nfs
>>>>>>  100003    4   tcp   2049  nfs
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Hi Ross:
>>>>>
>>>>> No, although that would be very nice, the Linux network maintainer
>>>>> didn't
>>>>> want RDMA transports sharing the network port space unfortunately.
>>>>>
>>>>> You would need to do this on the server to see if it is listening:
>>>>>
>>>>> # cat /proc/fs/nfsd/portlist
>>>>>
>>>>> You should see something like this:
>>>>>
>>>>> rdma 20049
>>>>> tcp 2049
>>>>> udp 2049
>>>>>
>>>>> The top line indicates that the rdma transport is listening on port
>>>>> 20049.
>>>>>
>>>>> If it's not showing, do this:
>>>>>
>>>>> # echo 20049 > /proc/fs/nfsd/portlist
>>>>>
>>>>> and repeat the 'cat' step above.
>>>>>
>>>>> To give us a little more detail to help debug, do this:
>>>>>
>>>>> # echo 32767 > /proc/sys/sunrpc/rpc_debug
>>>>>
>>>>> on both the client and server, then try the mount again. The dmesg log
>>>>> should have a detail trace on what is happening.
>>>>>
>>>>> Turn off the debug output as follows:
>>>>>
>>>>> # echo 0 > /proc/sys/sunrpc/rpc_debug
>>>>>
>>>>> Tom
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Tue, Jan 26, 2010 at 12:24 PM, Ross Smith <myxiplx at googlemail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> It's taken me a week, but I've finally gotten the 2.7.00 firmware for
>>>>>>> this system.  I've also taken the step of installing a Ubuntu 9.10
>>>>>>> server for testing in addition to the Solaris server I already have.
>>>>>>>
>>>>>>> So far I'm still having no joy, nfs mounts fine over TCP, but if I
>>>>>>> try
>>>>>>> to use RDMA it fails.
>>>>>>>
>>>>>>> Machines in use:
>>>>>>> ============
>>>>>>> Solaris Server, build 129 (about 4 weeks old), using built in
>>>>>>> Infiniband
>>>>>>> drivers
>>>>>>> Solaris Client, same build
>>>>>>> Ubuntu 9.10 Server, using kernel drivers
>>>>>>> Ubuntu 9.10 Client
>>>>>>> CentOS 5.2 Client, with OFED 1.4.2 and nfs-utils 1.1.6
>>>>>>>
>>>>>>> All five machines are on identical hardware, with Mellanox ConnectX
>>>>>>> infiniband cards running firmware 2.7.00.
>>>>>>>
>>>>>>> They all seem to be running Infiniband fine, ipoib works perfectly
>>>>>>> and
>>>>>>> I can connect regular tcp nfs mounts over the infiniband links
>>>>>>> without
>>>>>>> any issues.
>>>>>>>
>>>>>>> With regular tcp nfs I'm getting consistent speeds of 300MB/s.
>>>>>>>
>>>>>>> However, nfs-rdma just does not want to work, no matter which
>>>>>>> combination of servers and clients I try:
>>>>>>>
>>>>>>> Ubuntu Client -> Solaris
>>>>>>> =================
>>>>>>> Commands used:
>>>>>>> # modprobe xprtrdma
>>>>>>> # mount -o proto=rdma,port=20049 192.168.101.1:/test/rdma ./nfstest
>>>>>>>
>>>>>>> This is the entire dmesg log, from first loading the driver, to
>>>>>>> attempting to connect nfs-rdma:
>>>>>>>
>>>>>>> [   46.834146] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0
>>>>>>> (April
>>>>>>> 4, 2008)
>>>>>>> [   47.028093] ADDRCONF(NETDEV_UP): ib0: link is not ready
>>>>>>> [   52.018562] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
>>>>>>> [   52.018698] ib0: multicast join failed for
>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>> [   54.014289] ib0: multicast join failed for
>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>> [   58.006864] ib0: multicast join failed for
>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>> [   62.027202] ib0: no IPv6 routers present
>>>>>>> [   65.120791] RPC: Registered udp transport module.
>>>>>>> [   65.120795] RPC: Registered tcp transport module.
>>>>>>> [   65.129162] RPC: Registered rdma transport module.
>>>>>>> [   65.992081] ib0: multicast join failed for
>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>> [   81.962465] ib0: multicast join failed for
>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>> [   83.593144] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>> memreg 5 slots 32 ird 4
>>>>>>> [  148.476967] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>> (-111)
>>>>>>> [  148.480488] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>> (-111)
>>>>>>> [  148.484421] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>> (-111)
>>>>>>> [  148.488376] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>> (-111)
>>>>>>> [ 4311.663188] svc: failed to register lockdv1 RPC service (errno
>>>>>>> 97).
>>>>>>>
>>>>>>> At this point, the attempt crashed the Solaris server, and hung the
>>>>>>> mount attempt on the Ubuntu client, requiring ctrl-c on the client,
>>>>>>> and automatically rebooting the server.
>>>>>>>
>>>>>>> I then tried again, connecting to the Ubuntu nfs server.  This time
>>>>>>> neither device hung or crashed, but I had very similar messages in
>>>>>>> the
>>>>>>> client log:
>>>>>>>
>>>>>>> # mount -o proto=rdma,port=20049 192.168.101.5:/home/ross/nfsexport
>>>>>>> ./nfstest
>>>>>>>
>>>>>>> [ 4435.102852] rpcrdma: connection to 192.168.101.5:20049 closed
>>>>>>> (-111)
>>>>>>> [ 4435.107492] rpcrdma: connection to 192.168.101.5:20049 closed
>>>>>>> (-111)
>>>>>>> [ 4435.111471] rpcrdma: connection to 192.168.101.5:20049 closed
>>>>>>> (-111)
>>>>>>> [ 4435.115468] rpcrdma: connection to 192.168.101.5:20049 closed
>>>>>>> (-111)
>>>>>>>
>>>>>>> So it seems that it's not the server:  both Solaris and Ubuntu have
>>>>>>> the same problem, although Ubuntu at least does not crash when
>>>>>>> clients
>>>>>>> attempt to connect.
>>>>>>>
>>>>>>> I also get the same error if I attempt to connect from the CentOS 5.2
>>>>>>> machine which is using regular OFED to the Ubuntu server:
>>>>>>>
>>>>>>> CentOS 5.2 -> Ubuntu
>>>>>>> ================
>>>>>>> This time I'm running mount.rnfs directly as per the instructions in
>>>>>>> the OFED nfs-rdma release notes.
>>>>>>>
>>>>>>> commands used:
>>>>>>> # modprobe xprtrdma
>>>>>>> # mount.rnfs 192.168.101.5:/home/ross/nfsexport ./rdmatest -i -o
>>>>>>> proto=rdma,port=20049
>>>>>>>
>>>>>>> dmesg results look very similar:
>>>>>>> rpcrdma: connection to 192.168.101.5:20049 closed (-111)
>>>>>>> rpcrdma: connection to 192.168.101.5:20049 closed (-111)
>>>>>>> rpcrdma: connection to 192.168.101.5:20049 closed (-111)
>>>>>>> rpcrdma: connection to 192.168.101.5:20049 closed (-111)
>>>>>>>
>>>>>>> However attempting this has a bad effect on CentOS - the client
>>>>>>> crashes and I loose my ssh session.
>>>>>>>
>>>>>>> Does anybody have any ideas?
>>>>>>>
>>>>>>> thanks,
>>>>>>>
>>>>>>> Ross
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 18, 2010 at 6:31 PM, David Brean <David.Brean at sun.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I agree, update the HCA firmware before proceeding.  [The
>>>>>>>> description
>>>>>>>> in
>>>>>>>> Bugzilla Bug 1711 seems to match the problem that you are
>>>>>>>> observing.]
>>>>>>>>
>>>>>>>> Also, if you want to help diagnose the "ib0: post_send failed", take
>>>>>>>> a
>>>>>>>> look
>>>>>>>> at
>>>>>>>>
>>>>>>>> http://lists.openfabrics.org/pipermail/general/2009-July/061118.html.
>>>>>>>>
>>>>>>>> -David
>>>>>>>>
>>>>>>>> Ross Smith wrote:
>>>>>>>>
>>>>>>>> Hi Tom,
>>>>>>>>
>>>>>>>> No, you're right - I'm just using the support that's built into the
>>>>>>>> kernel, and I agree, diagnostics from Solaris is proving very
>>>>>>>> tricky.
>>>>>>>> I do have a Solaris client connected to this and showing some decent
>>>>>>>> speeds (over 900Mb/s), but I've been thinking that I might need to
>>>>>>>> get
>>>>>>>> a Linux server running for testing before I spend much more time
>>>>>>>> trying to get the two separate systems working.
>>>>>>>>
>>>>>>>> However, I have found over the weekend that I'm running older
>>>>>>>> firmware
>>>>>>>> and need that updating.  I'd missed that in the nfs-rdma readme so
>>>>>>>> I'm
>>>>>>>> pretty sure that's going to be causing problems.  I'm trying to get
>>>>>>>> that resolved before I do too much other testing.
>>>>>>>>
>>>>>>>> Regular NFS running over the ipoib link seems fine, and I don't get
>>>>>>>> any extra warnings using that.  I can also run a full virtual
>>>>>>>> machine
>>>>>>>> quite happily over NFS, so despite the warnings, the link does
>>>>>>>> appear
>>>>>>>> stable and reliable.
>>>>>>>>
>>>>>>>> Ross
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jan 18, 2010 at 4:30 PM, Tom Tucker
>>>>>>>> <tom at opengridcomputing.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Ross:
>>>>>>>>
>>>>>>>> I would check that you have IB RDMA actually working. The core
>>>>>>>> transport
>>>>>>>> issues suggest that there may be network problems that will prevent
>>>>>>>> NFSRDMA
>>>>>>>> from working properly.
>>>>>>>>
>>>>>>>> The first question is whether or not you are actually using OFED.
>>>>>>>> You're
>>>>>>>> not
>>>>>>>>  -- right? You're just using the support built into the 2.6.31
>>>>>>>> kernel?
>>>>>>>>
>>>>>>>> Second I don't think the mount is actually completing. I think the
>>>>>>>> command
>>>>>>>> is returning, but the mount never actually finishes. It's sitting
>>>>>>>> there
>>>>>>>> hung
>>>>>>>> trying to perform the first RPC to the server (RPC_NOP) and it's
>>>>>>>> never
>>>>>>>> succeeding. That's why you see all those connect/disconnect messages
>>>>>>>> in
>>>>>>>> your
>>>>>>>> log file. It tries to send, gets an error, disconnects, reconnects,
>>>>>>>> tries to
>>>>>>>> send .... you get the picture.
>>>>>>>>
>>>>>>>> Step 1 I think would be to ensure that you actually have IB up and
>>>>>>>> running.
>>>>>>>> IPoIB between the two seems a little dodgy given the dmesg log. Do
>>>>>>>> you
>>>>>>>> have
>>>>>>>> another Linux box you can use to test out connectivity/configuration
>>>>>>>> with
>>>>>>>> your victim? There are test programs in OFED (rping) that would help
>>>>>>>> you
>>>>>>>> do
>>>>>>>> this, but I don't believe they are available on Solaris.
>>>>>>>>
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> Steve Wise wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> nfsrdma hang on ewg...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -------- Original Message --------
>>>>>>>> Subject:     [ewg] nfs-rdma hanging with Ubuntu 9.10
>>>>>>>> Date:     Fri, 15 Jan 2010 13:28:31 +0000
>>>>>>>> From:     Ross Smith <myxiplx at googlemail.com>
>>>>>>>> To:     ewg at openfabrics.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi folks, it's me again I'm afraid.
>>>>>>>>
>>>>>>>> Thanks to the help from this list, I have ipoib working, however I
>>>>>>>> seem to be having a few problems, not least of which is commands
>>>>>>>> hanging if I attempt to use nfs-rdma.
>>>>>>>>
>>>>>>>> Although the rmda mount command completes, the system then becomes
>>>>>>>> unresponsive if I attempt any command such as 'ls', even outside of
>>>>>>>> the mounted folder.  Umount also fails with the error "device is
>>>>>>>> busy".
>>>>>>>>
>>>>>>>> If anybody can spare the time to help it would be very much
>>>>>>>> appreciated.  I do seem to have a lot of warnings in the logs, but
>>>>>>>> although I've tried searching for solutions haven't found anything
>>>>>>>> yet.
>>>>>>>>
>>>>>>>>
>>>>>>>> System details
>>>>>>>> ============
>>>>>>>> - Ubuntu 9.10
>>>>>>>>  (kernel 2.6.31)
>>>>>>>> - Mellanox ConnectX QDR card
>>>>>>>> - Flextronics DDR switch
>>>>>>>> - OpenSolaris NFS server, running one of the latest builds for
>>>>>>>> troubleshooting
>>>>>>>> - OpenSM running on another Ubuntu 9.10 box with a Mellanox
>>>>>>>> Infinihost III Lx card
>>>>>>>>
>>>>>>>> I am using the kernel drivers only, I have not installed OFED on
>>>>>>>> this
>>>>>>>> machine.
>>>>>>>>
>>>>>>>>
>>>>>>>> Loading driver
>>>>>>>> ============
>>>>>>>> The driver appears to load, and ipoib works, but there are rather a
>>>>>>>> lot of warnings from dmesg.
>>>>>>>>
>>>>>>>> I am loading the driver with:
>>>>>>>> $ sudo modprobe mlx4_ib
>>>>>>>> $ sudo modprobe ib_ipoib
>>>>>>>> $ sudo ifconfig ib0 192.168.101.4 netmask 255.255.255.0 up
>>>>>>>>
>>>>>>>> And that leaves me with:
>>>>>>>> $ lsmod
>>>>>>>> Module                  Size  Used by
>>>>>>>> ib_ipoib               72452  0
>>>>>>>> ib_cm                  37196  1 ib_ipoib
>>>>>>>> ib_sa                  19812  2 ib_ipoib,ib_cm
>>>>>>>> mlx4_ib                42720  0
>>>>>>>> ib_mad                 37524  3 ib_cm,ib_sa,mlx4_ib
>>>>>>>> ib_core                57884  5 ib_ipoib,ib_cm,ib_sa,mlx4_ib,ib_mad
>>>>>>>> binfmt_misc             8356  1
>>>>>>>> ppdev                   6688  0
>>>>>>>> psmouse                56180  0
>>>>>>>> serio_raw               5280  0
>>>>>>>> mlx4_core              84728  1 mlx4_ib
>>>>>>>> joydev                 10272  0
>>>>>>>> lp                      8964  0
>>>>>>>> parport                35340  2 ppdev,lp
>>>>>>>> iptable_filter          3100  0
>>>>>>>> ip_tables              11692  1 iptable_filter
>>>>>>>> x_tables               16544  1 ip_tables
>>>>>>>> usbhid                 38208  0
>>>>>>>> e1000e                122124  0
>>>>>>>>
>>>>>>>>
>>>>>>>> At this point I can ping the Solaris server over the IP link.
>>>>>>>> Although I do need to issue a ping from Solaris before I get a
>>>>>>>> reply.
>>>>>>>> I'm mentioning that it in case it's relevant, but at this point I'm
>>>>>>>> assuming that's just a firewall setting on the server.
>>>>>>>>
>>>>>>>> But although ping works, I am starting to get some dmesg warnings, I
>>>>>>>> just don't know if they are relevant:
>>>>>>>> [  313.692072] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0
>>>>>>>> (April
>>>>>>>> 4,
>>>>>>>> 2008)
>>>>>>>> [  313.885220] ADDRCONF(NETDEV_UP): ib0: link is not ready
>>>>>>>> [  316.880450] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [  316.880573] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
>>>>>>>> [  316.880789] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [  320.873613] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [  327.147114] ib0: no IPv6 routers present
>>>>>>>> [  328.861550] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [  344.834440] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [  360.808312] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [  376.782186] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>>
>>>>>>>> And at this point however, regular nfs mounts work fine over the
>>>>>>>> ipoib
>>>>>>>> link:
>>>>>>>> $ sudo mount 192.168.100.1:/test/rdma ./nfstest
>>>>>>>>
>>>>>>>> Bug again, that again adds warnings to dmesg:
>>>>>>>> [  826.456902] RPC: Registered udp transport module.
>>>>>>>> [  826.456905] RPC: Registered tcp transport module.
>>>>>>>> [  841.553135] svc: failed to register lockdv1 RPC service (errno
>>>>>>>> 97).
>>>>>>>>
>>>>>>>> And the speed is definitely nothing to write home about, copying a
>>>>>>>> 100mb file takes over 10 seconds:
>>>>>>>> $ time cp ./100mb ./100mb2
>>>>>>>>
>>>>>>>> real    0m10.472s
>>>>>>>> user    0m0.000s
>>>>>>>> sys    0m1.248s
>>>>>>>>
>>>>>>>> And again with warnings appearing in dmesg:
>>>>>>>> [  872.373364] ib0: post_send failed
>>>>>>>> [  872.373407] ib0: post_send failed
>>>>>>>> [  872.373448] ib0: post_send failed
>>>>>>>>
>>>>>>>> I think this is a client issue rather than a problem on the server
>>>>>>>> as
>>>>>>>> the same test on an OpenSolaris client takes under half a second:
>>>>>>>> # time cp ./100mb ./100mb2
>>>>>>>>
>>>>>>>> real    0m0.334s
>>>>>>>> user    0m0.001s
>>>>>>>> sys     0m0.176s
>>>>>>>>
>>>>>>>> Although the system is definitely not right, my long term aim is to
>>>>>>>> run nfs-rdma on this system, so my next test was to try that and see
>>>>>>>> if the speed improved:
>>>>>>>>
>>>>>>>> $ sudo umount ./nfstest
>>>>>>>> $ sudo mount -o rdma,port=20049 192.168.101.1:/test/rdma ./nfstest
>>>>>>>>
>>>>>>>> That takes a long time to connect.  It does eventually go through,
>>>>>>>> but
>>>>>>>> only after the following errors in dmesg:
>>>>>>>>
>>>>>>>> [ 1140.698659] RPC: Registered rdma transport module.
>>>>>>>> [ 1155.697672] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 1160.688455] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-103)
>>>>>>>> [ 1160.693818] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 1160.695131] svc: failed to register lockdv1 RPC service (errno
>>>>>>>> 97).
>>>>>>>> [ 1170.676049] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-103)
>>>>>>>> [ 1170.681458] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 1190.647355] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-103)
>>>>>>>> [ 1190.652778] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 1220.602353] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-103)
>>>>>>>> [ 1220.607809] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 1250.557397] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-103)
>>>>>>>> [ 1250.562817] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 1281.522735] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-103)
>>>>>>>> [ 1281.528442] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 1311.477845] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-103)
>>>>>>>> [ 1311.482983] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 1341.432758] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-103)
>>>>>>>> [ 1341.438212] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>
>>>>>>>> However, at this point my shell session becomes unresponsive if I
>>>>>>>> attempt so much as a 'ls'.  The system hasn't hung completely
>>>>>>>> however
>>>>>>>> as I can still connect another ssh session and restart with
>>>>>>>> $ sudo init 6
>>>>>>>>
>>>>>>>> Can anybody help?  Is there anything obvious I am doing wrong here?
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>>
>>>>>>>> Ross
>>>>>>>> _______________________________________________
>>>>>>>> ewg mailing list
>>>>>>>> ewg at lists.openfabrics.org
>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ewg mailing list
>>>>>>>> ewg at lists.openfabrics.org
>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ewg mailing list
>>>>>> ewg at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>
>



More information about the ewg mailing list