[ewg] [Fwd: nfs-rdma hanging with Ubuntu 9.10]
Tom Tucker
tom at opengridcomputing.com
Tue Jan 26 08:26:14 PST 2010
Ross Smith wrote:
> Interesting, the single '>' didn't work for me, it removed the tcp and
> udp entries, leaving me with just rdma. It looks like you do need the
>
Huh. Can you do a 'uname -a' for me. Someone has changed that. Thank you
for the heads up.
Tom
>>> on Ubuntu 9.10.
>>>
>
>
> On Tue, Jan 26, 2010 at 4:20 PM, Tom Tucker <tom at opengridcomputing.com> wrote:
>
>> Ross Smith wrote:
>>
>>> No problem, but I think you need an extra > too :)
>>>
>>> # echo "rdma 20049" >> /proc/fs/nfsd/portlist
>>>
>>>
>> Actually, it's not a "real" file. So just the single '>' will work fine.
>> There is logic inside the kernel
>> that handles the write and converts the 'rdma 20049' to calls inside the
>> kernel that create a
>> listening endpoint for the rdma transport.
>>
>>> And that was enough to get me going, although there was one more step
>>> I'd missed:
>>>
>>> # mount 192.168.101.5:/home/ross/nfsexport ./nfstest -o
>>> proto=rdma,port=20049
>>> mount.nfs: Operation not permitted
>>>
>>> Googling that lead me to modify /etc/exports on the server to add the
>>> insecure option. With that added it works fine.
>>>
>>>
>>>
>> Awesome.
>>
>>
>>> Now I just need to get it connecting to Solaris without crashing the
>>> server :)
>>>
>>> Many thanks for all the help.
>>>
>>> Ross
>>>
>>>
>>>
>>> On Tue, Jan 26, 2010 at 3:49 PM, Tom Tucker <tom at opengridcomputing.com>
>>> wrote:
>>>
>>>
>>>> Ross Smith wrote:
>>>>
>>>>
>>>>> Hmm, the portlist doesn't look good:
>>>>>
>>>>> $ cat /proc/fs/nfsd/portlist
>>>>> tcp 2049
>>>>> udp 2049
>>>>>
>>>>>
>>>>>
>>>>>
>>>> No it looks great, that's an easy one! No one is listening on 20049, so
>>>> you
>>>> get 111 (ECONNREFUSED)
>>>>
>>>>
>>>>
>>>>> But attempting to modify that fails:
>>>>>
>>>>> # echo 20049 > /proc/fs/nfsd/portlist
>>>>> -bash: echo: write error: Bad file descriptor
>>>>>
>>>>>
>>>>>
>>>>>
>>>> That's because I gave you the wrong syntax for the write command. It
>>>> should
>>>> be the following:
>>>>
>>>> # echo "rdma 20049" > /proc/fs/nfsd/portlist
>>>>
>>>> Sorry about that.
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>>> And I get similar problems attempting to enable the debugging logs:
>>>>>
>>>>> # echo 32767 > /proc/sys/sunrpc/rpc_debug
>>>>> -bash: /proc/sys/sunrpc/rpc_debug: Permission denied
>>>>>
>>>>> Up to that point through everything looks like it's loading fine:
>>>>>
>>>>> Ubuntu server:
>>>>> ===========
>>>>> # modprobe mlx4_ib
>>>>> # modprobe ib_ipoib
>>>>> # ifconfig ib0 192.168.101.5 netmask 255.255.255.0 up
>>>>>
>>>>> dmesg results:
>>>>> [ 456.793661] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April
>>>>> 4,
>>>>> 2008)
>>>>> [ 456.987043] ADDRCONF(NETDEV_UP): ib0: link is not ready
>>>>> [ 459.988683] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
>>>>> [ 470.686631] ib0: no IPv6 routers present
>>>>>
>>>>> # modprobe svcrdma
>>>>> # /etc/init.d/nfs-kernel-server restart
>>>>>
>>>>> dmesg:
>>>>> [ 524.520198] nfsd: last server has exited, flushing export cache
>>>>> [ 529.292366] svc: failed to register lockdv1 RPC service (errno 97).
>>>>> [ 529.293289] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
>>>>> recovery directory
>>>>> [ 529.293304] NFSD: starting 90-second grace period
>>>>>
>>>>> Ubuntu client:
>>>>> ==========
>>>>> # modprobe mlx4_ib
>>>>> # modprobe ib_ipoib
>>>>> # ifconfig ib0 192.168.101.4 netmask 255.255.255.0 up
>>>>>
>>>>> dmesg:
>>>>> [ 97.576507] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April
>>>>> 4,
>>>>> 2008)
>>>>> [ 97.769582] ADDRCONF(NETDEV_UP): ib0: link is not ready
>>>>> [ 100.765318] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
>>>>> [ 110.899591] ib0: no IPv6 routers present
>>>>>
>>>>> # modprobe xprtrdma
>>>>>
>>>>> dmesg:
>>>>> [ 169.269689] RPC: Registered udp transport module.
>>>>> [ 169.269691] RPC: Registered tcp transport module.
>>>>> [ 169.289755] RPC: Registered rdma transport module.
>>>>>
>>>>> Ross
>>>>>
>>>>>
>>>>> On Tue, Jan 26, 2010 at 2:32 PM, Tom Tucker <tom at opengridcomputing.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Ross Smith wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> A quick addendum to that, I've just had a look at rpcinfo on both the
>>>>>>> Ubuntu and Solaris NFS servers, does this indicate that nfs-rdma is
>>>>>>> not actually running?
>>>>>>>
>>>>>>> rpcinfo -p
>>>>>>> program vers proto port
>>>>>>> 100000 2 tcp 111 portmapper
>>>>>>> 100000 2 udp 111 portmapper
>>>>>>> 100024 1 udp 37031 status
>>>>>>> 100024 1 tcp 58463 status
>>>>>>> 100021 1 udp 34989 nlockmgr
>>>>>>> 100021 3 udp 34989 nlockmgr
>>>>>>> 100021 4 udp 34989 nlockmgr
>>>>>>> 100021 1 tcp 47979 nlockmgr
>>>>>>> 100021 3 tcp 47979 nlockmgr
>>>>>>> 100021 4 tcp 47979 nlockmgr
>>>>>>> 100003 2 udp 2049 nfs
>>>>>>> 100003 3 udp 2049 nfs
>>>>>>> 100003 4 udp 2049 nfs
>>>>>>> 100003 2 tcp 2049 nfs
>>>>>>> 100003 3 tcp 2049 nfs
>>>>>>> 100003 4 tcp 2049 nfs
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Hi Ross:
>>>>>>
>>>>>> No, although that would be very nice, the Linux network maintainer
>>>>>> didn't
>>>>>> want RDMA transports sharing the network port space unfortunately.
>>>>>>
>>>>>> You would need to do this on the server to see if it is listening:
>>>>>>
>>>>>> # cat /proc/fs/nfsd/portlist
>>>>>>
>>>>>> You should see something like this:
>>>>>>
>>>>>> rdma 20049
>>>>>> tcp 2049
>>>>>> udp 2049
>>>>>>
>>>>>> The top line indicates that the rdma transport is listening on port
>>>>>> 20049.
>>>>>>
>>>>>> If it's not showing, do this:
>>>>>>
>>>>>> # echo 20049 > /proc/fs/nfsd/portlist
>>>>>>
>>>>>> and repeat the 'cat' step above.
>>>>>>
>>>>>> To give us a little more detail to help debug, do this:
>>>>>>
>>>>>> # echo 32767 > /proc/sys/sunrpc/rpc_debug
>>>>>>
>>>>>> on both the client and server, then try the mount again. The dmesg log
>>>>>> should have a detail trace on what is happening.
>>>>>>
>>>>>> Turn off the debug output as follows:
>>>>>>
>>>>>> # echo 0 > /proc/sys/sunrpc/rpc_debug
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Tue, Jan 26, 2010 at 12:24 PM, Ross Smith <myxiplx at googlemail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> It's taken me a week, but I've finally gotten the 2.7.00 firmware for
>>>>>>>> this system. I've also taken the step of installing a Ubuntu 9.10
>>>>>>>> server for testing in addition to the Solaris server I already have.
>>>>>>>>
>>>>>>>> So far I'm still having no joy, nfs mounts fine over TCP, but if I
>>>>>>>> try
>>>>>>>> to use RDMA it fails.
>>>>>>>>
>>>>>>>> Machines in use:
>>>>>>>> ============
>>>>>>>> Solaris Server, build 129 (about 4 weeks old), using built in
>>>>>>>> Infiniband
>>>>>>>> drivers
>>>>>>>> Solaris Client, same build
>>>>>>>> Ubuntu 9.10 Server, using kernel drivers
>>>>>>>> Ubuntu 9.10 Client
>>>>>>>> CentOS 5.2 Client, with OFED 1.4.2 and nfs-utils 1.1.6
>>>>>>>>
>>>>>>>> All five machines are on identical hardware, with Mellanox ConnectX
>>>>>>>> infiniband cards running firmware 2.7.00.
>>>>>>>>
>>>>>>>> They all seem to be running Infiniband fine, ipoib works perfectly
>>>>>>>> and
>>>>>>>> I can connect regular tcp nfs mounts over the infiniband links
>>>>>>>> without
>>>>>>>> any issues.
>>>>>>>>
>>>>>>>> With regular tcp nfs I'm getting consistent speeds of 300MB/s.
>>>>>>>>
>>>>>>>> However, nfs-rdma just does not want to work, no matter which
>>>>>>>> combination of servers and clients I try:
>>>>>>>>
>>>>>>>> Ubuntu Client -> Solaris
>>>>>>>> =================
>>>>>>>> Commands used:
>>>>>>>> # modprobe xprtrdma
>>>>>>>> # mount -o proto=rdma,port=20049 192.168.101.1:/test/rdma ./nfstest
>>>>>>>>
>>>>>>>> This is the entire dmesg log, from first loading the driver, to
>>>>>>>> attempting to connect nfs-rdma:
>>>>>>>>
>>>>>>>> [ 46.834146] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0
>>>>>>>> (April
>>>>>>>> 4, 2008)
>>>>>>>> [ 47.028093] ADDRCONF(NETDEV_UP): ib0: link is not ready
>>>>>>>> [ 52.018562] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
>>>>>>>> [ 52.018698] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [ 54.014289] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [ 58.006864] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [ 62.027202] ib0: no IPv6 routers present
>>>>>>>> [ 65.120791] RPC: Registered udp transport module.
>>>>>>>> [ 65.120795] RPC: Registered tcp transport module.
>>>>>>>> [ 65.129162] RPC: Registered rdma transport module.
>>>>>>>> [ 65.992081] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [ 81.962465] ib0: multicast join failed for
>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>> [ 83.593144] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>> [ 148.476967] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-111)
>>>>>>>> [ 148.480488] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-111)
>>>>>>>> [ 148.484421] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-111)
>>>>>>>> [ 148.488376] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>> (-111)
>>>>>>>> [ 4311.663188] svc: failed to register lockdv1 RPC service (errno
>>>>>>>> 97).
>>>>>>>>
>>>>>>>> At this point, the attempt crashed the Solaris server, and hung the
>>>>>>>> mount attempt on the Ubuntu client, requiring ctrl-c on the client,
>>>>>>>> and automatically rebooting the server.
>>>>>>>>
>>>>>>>> I then tried again, connecting to the Ubuntu nfs server. This time
>>>>>>>> neither device hung or crashed, but I had very similar messages in
>>>>>>>> the
>>>>>>>> client log:
>>>>>>>>
>>>>>>>> # mount -o proto=rdma,port=20049 192.168.101.5:/home/ross/nfsexport
>>>>>>>> ./nfstest
>>>>>>>>
>>>>>>>> [ 4435.102852] rpcrdma: connection to 192.168.101.5:20049 closed
>>>>>>>> (-111)
>>>>>>>> [ 4435.107492] rpcrdma: connection to 192.168.101.5:20049 closed
>>>>>>>> (-111)
>>>>>>>> [ 4435.111471] rpcrdma: connection to 192.168.101.5:20049 closed
>>>>>>>> (-111)
>>>>>>>> [ 4435.115468] rpcrdma: connection to 192.168.101.5:20049 closed
>>>>>>>> (-111)
>>>>>>>>
>>>>>>>> So it seems that it's not the server: both Solaris and Ubuntu have
>>>>>>>> the same problem, although Ubuntu at least does not crash when
>>>>>>>> clients
>>>>>>>> attempt to connect.
>>>>>>>>
>>>>>>>> I also get the same error if I attempt to connect from the CentOS 5.2
>>>>>>>> machine which is using regular OFED to the Ubuntu server:
>>>>>>>>
>>>>>>>> CentOS 5.2 -> Ubuntu
>>>>>>>> ================
>>>>>>>> This time I'm running mount.rnfs directly as per the instructions in
>>>>>>>> the OFED nfs-rdma release notes.
>>>>>>>>
>>>>>>>> commands used:
>>>>>>>> # modprobe xprtrdma
>>>>>>>> # mount.rnfs 192.168.101.5:/home/ross/nfsexport ./rdmatest -i -o
>>>>>>>> proto=rdma,port=20049
>>>>>>>>
>>>>>>>> dmesg results look very similar:
>>>>>>>> rpcrdma: connection to 192.168.101.5:20049 closed (-111)
>>>>>>>> rpcrdma: connection to 192.168.101.5:20049 closed (-111)
>>>>>>>> rpcrdma: connection to 192.168.101.5:20049 closed (-111)
>>>>>>>> rpcrdma: connection to 192.168.101.5:20049 closed (-111)
>>>>>>>>
>>>>>>>> However attempting this has a bad effect on CentOS - the client
>>>>>>>> crashes and I loose my ssh session.
>>>>>>>>
>>>>>>>> Does anybody have any ideas?
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>>
>>>>>>>> Ross
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jan 18, 2010 at 6:31 PM, David Brean <David.Brean at sun.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I agree, update the HCA firmware before proceeding. [The
>>>>>>>>> description
>>>>>>>>> in
>>>>>>>>> Bugzilla Bug 1711 seems to match the problem that you are
>>>>>>>>> observing.]
>>>>>>>>>
>>>>>>>>> Also, if you want to help diagnose the "ib0: post_send failed", take
>>>>>>>>> a
>>>>>>>>> look
>>>>>>>>> at
>>>>>>>>>
>>>>>>>>> http://lists.openfabrics.org/pipermail/general/2009-July/061118.html.
>>>>>>>>>
>>>>>>>>> -David
>>>>>>>>>
>>>>>>>>> Ross Smith wrote:
>>>>>>>>>
>>>>>>>>> Hi Tom,
>>>>>>>>>
>>>>>>>>> No, you're right - I'm just using the support that's built into the
>>>>>>>>> kernel, and I agree, diagnostics from Solaris is proving very
>>>>>>>>> tricky.
>>>>>>>>> I do have a Solaris client connected to this and showing some decent
>>>>>>>>> speeds (over 900Mb/s), but I've been thinking that I might need to
>>>>>>>>> get
>>>>>>>>> a Linux server running for testing before I spend much more time
>>>>>>>>> trying to get the two separate systems working.
>>>>>>>>>
>>>>>>>>> However, I have found over the weekend that I'm running older
>>>>>>>>> firmware
>>>>>>>>> and need that updating. I'd missed that in the nfs-rdma readme so
>>>>>>>>> I'm
>>>>>>>>> pretty sure that's going to be causing problems. I'm trying to get
>>>>>>>>> that resolved before I do too much other testing.
>>>>>>>>>
>>>>>>>>> Regular NFS running over the ipoib link seems fine, and I don't get
>>>>>>>>> any extra warnings using that. I can also run a full virtual
>>>>>>>>> machine
>>>>>>>>> quite happily over NFS, so despite the warnings, the link does
>>>>>>>>> appear
>>>>>>>>> stable and reliable.
>>>>>>>>>
>>>>>>>>> Ross
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jan 18, 2010 at 4:30 PM, Tom Tucker
>>>>>>>>> <tom at opengridcomputing.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Ross:
>>>>>>>>>
>>>>>>>>> I would check that you have IB RDMA actually working. The core
>>>>>>>>> transport
>>>>>>>>> issues suggest that there may be network problems that will prevent
>>>>>>>>> NFSRDMA
>>>>>>>>> from working properly.
>>>>>>>>>
>>>>>>>>> The first question is whether or not you are actually using OFED.
>>>>>>>>> You're
>>>>>>>>> not
>>>>>>>>> -- right? You're just using the support built into the 2.6.31
>>>>>>>>> kernel?
>>>>>>>>>
>>>>>>>>> Second I don't think the mount is actually completing. I think the
>>>>>>>>> command
>>>>>>>>> is returning, but the mount never actually finishes. It's sitting
>>>>>>>>> there
>>>>>>>>> hung
>>>>>>>>> trying to perform the first RPC to the server (RPC_NOP) and it's
>>>>>>>>> never
>>>>>>>>> succeeding. That's why you see all those connect/disconnect messages
>>>>>>>>> in
>>>>>>>>> your
>>>>>>>>> log file. It tries to send, gets an error, disconnects, reconnects,
>>>>>>>>> tries to
>>>>>>>>> send .... you get the picture.
>>>>>>>>>
>>>>>>>>> Step 1 I think would be to ensure that you actually have IB up and
>>>>>>>>> running.
>>>>>>>>> IPoIB between the two seems a little dodgy given the dmesg log. Do
>>>>>>>>> you
>>>>>>>>> have
>>>>>>>>> another Linux box you can use to test out connectivity/configuration
>>>>>>>>> with
>>>>>>>>> your victim? There are test programs in OFED (rping) that would help
>>>>>>>>> you
>>>>>>>>> do
>>>>>>>>> this, but I don't believe they are available on Solaris.
>>>>>>>>>
>>>>>>>>> Tom
>>>>>>>>>
>>>>>>>>> Steve Wise wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> nfsrdma hang on ewg...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------- Original Message --------
>>>>>>>>> Subject: [ewg] nfs-rdma hanging with Ubuntu 9.10
>>>>>>>>> Date: Fri, 15 Jan 2010 13:28:31 +0000
>>>>>>>>> From: Ross Smith <myxiplx at googlemail.com>
>>>>>>>>> To: ewg at openfabrics.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi folks, it's me again I'm afraid.
>>>>>>>>>
>>>>>>>>> Thanks to the help from this list, I have ipoib working, however I
>>>>>>>>> seem to be having a few problems, not least of which is commands
>>>>>>>>> hanging if I attempt to use nfs-rdma.
>>>>>>>>>
>>>>>>>>> Although the rmda mount command completes, the system then becomes
>>>>>>>>> unresponsive if I attempt any command such as 'ls', even outside of
>>>>>>>>> the mounted folder. Umount also fails with the error "device is
>>>>>>>>> busy".
>>>>>>>>>
>>>>>>>>> If anybody can spare the time to help it would be very much
>>>>>>>>> appreciated. I do seem to have a lot of warnings in the logs, but
>>>>>>>>> although I've tried searching for solutions haven't found anything
>>>>>>>>> yet.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> System details
>>>>>>>>> ============
>>>>>>>>> - Ubuntu 9.10
>>>>>>>>> (kernel 2.6.31)
>>>>>>>>> - Mellanox ConnectX QDR card
>>>>>>>>> - Flextronics DDR switch
>>>>>>>>> - OpenSolaris NFS server, running one of the latest builds for
>>>>>>>>> troubleshooting
>>>>>>>>> - OpenSM running on another Ubuntu 9.10 box with a Mellanox
>>>>>>>>> Infinihost III Lx card
>>>>>>>>>
>>>>>>>>> I am using the kernel drivers only, I have not installed OFED on
>>>>>>>>> this
>>>>>>>>> machine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Loading driver
>>>>>>>>> ============
>>>>>>>>> The driver appears to load, and ipoib works, but there are rather a
>>>>>>>>> lot of warnings from dmesg.
>>>>>>>>>
>>>>>>>>> I am loading the driver with:
>>>>>>>>> $ sudo modprobe mlx4_ib
>>>>>>>>> $ sudo modprobe ib_ipoib
>>>>>>>>> $ sudo ifconfig ib0 192.168.101.4 netmask 255.255.255.0 up
>>>>>>>>>
>>>>>>>>> And that leaves me with:
>>>>>>>>> $ lsmod
>>>>>>>>> Module Size Used by
>>>>>>>>> ib_ipoib 72452 0
>>>>>>>>> ib_cm 37196 1 ib_ipoib
>>>>>>>>> ib_sa 19812 2 ib_ipoib,ib_cm
>>>>>>>>> mlx4_ib 42720 0
>>>>>>>>> ib_mad 37524 3 ib_cm,ib_sa,mlx4_ib
>>>>>>>>> ib_core 57884 5 ib_ipoib,ib_cm,ib_sa,mlx4_ib,ib_mad
>>>>>>>>> binfmt_misc 8356 1
>>>>>>>>> ppdev 6688 0
>>>>>>>>> psmouse 56180 0
>>>>>>>>> serio_raw 5280 0
>>>>>>>>> mlx4_core 84728 1 mlx4_ib
>>>>>>>>> joydev 10272 0
>>>>>>>>> lp 8964 0
>>>>>>>>> parport 35340 2 ppdev,lp
>>>>>>>>> iptable_filter 3100 0
>>>>>>>>> ip_tables 11692 1 iptable_filter
>>>>>>>>> x_tables 16544 1 ip_tables
>>>>>>>>> usbhid 38208 0
>>>>>>>>> e1000e 122124 0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> At this point I can ping the Solaris server over the IP link.
>>>>>>>>> Although I do need to issue a ping from Solaris before I get a
>>>>>>>>> reply.
>>>>>>>>> I'm mentioning that it in case it's relevant, but at this point I'm
>>>>>>>>> assuming that's just a firewall setting on the server.
>>>>>>>>>
>>>>>>>>> But although ping works, I am starting to get some dmesg warnings, I
>>>>>>>>> just don't know if they are relevant:
>>>>>>>>> [ 313.692072] mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0
>>>>>>>>> (April
>>>>>>>>> 4,
>>>>>>>>> 2008)
>>>>>>>>> [ 313.885220] ADDRCONF(NETDEV_UP): ib0: link is not ready
>>>>>>>>> [ 316.880450] ib0: multicast join failed for
>>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>>> [ 316.880573] ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
>>>>>>>>> [ 316.880789] ib0: multicast join failed for
>>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>>> [ 320.873613] ib0: multicast join failed for
>>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>>> [ 327.147114] ib0: no IPv6 routers present
>>>>>>>>> [ 328.861550] ib0: multicast join failed for
>>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>>> [ 344.834440] ib0: multicast join failed for
>>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>>> [ 360.808312] ib0: multicast join failed for
>>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>>> [ 376.782186] ib0: multicast join failed for
>>>>>>>>> ff12:401b:ffff:0000:0000:0000:0000:00fb, status -22
>>>>>>>>>
>>>>>>>>> And at this point however, regular nfs mounts work fine over the
>>>>>>>>> ipoib
>>>>>>>>> link:
>>>>>>>>> $ sudo mount 192.168.100.1:/test/rdma ./nfstest
>>>>>>>>>
>>>>>>>>> Bug again, that again adds warnings to dmesg:
>>>>>>>>> [ 826.456902] RPC: Registered udp transport module.
>>>>>>>>> [ 826.456905] RPC: Registered tcp transport module.
>>>>>>>>> [ 841.553135] svc: failed to register lockdv1 RPC service (errno
>>>>>>>>> 97).
>>>>>>>>>
>>>>>>>>> And the speed is definitely nothing to write home about, copying a
>>>>>>>>> 100mb file takes over 10 seconds:
>>>>>>>>> $ time cp ./100mb ./100mb2
>>>>>>>>>
>>>>>>>>> real 0m10.472s
>>>>>>>>> user 0m0.000s
>>>>>>>>> sys 0m1.248s
>>>>>>>>>
>>>>>>>>> And again with warnings appearing in dmesg:
>>>>>>>>> [ 872.373364] ib0: post_send failed
>>>>>>>>> [ 872.373407] ib0: post_send failed
>>>>>>>>> [ 872.373448] ib0: post_send failed
>>>>>>>>>
>>>>>>>>> I think this is a client issue rather than a problem on the server
>>>>>>>>> as
>>>>>>>>> the same test on an OpenSolaris client takes under half a second:
>>>>>>>>> # time cp ./100mb ./100mb2
>>>>>>>>>
>>>>>>>>> real 0m0.334s
>>>>>>>>> user 0m0.001s
>>>>>>>>> sys 0m0.176s
>>>>>>>>>
>>>>>>>>> Although the system is definitely not right, my long term aim is to
>>>>>>>>> run nfs-rdma on this system, so my next test was to try that and see
>>>>>>>>> if the speed improved:
>>>>>>>>>
>>>>>>>>> $ sudo umount ./nfstest
>>>>>>>>> $ sudo mount -o rdma,port=20049 192.168.101.1:/test/rdma ./nfstest
>>>>>>>>>
>>>>>>>>> That takes a long time to connect. It does eventually go through,
>>>>>>>>> but
>>>>>>>>> only after the following errors in dmesg:
>>>>>>>>>
>>>>>>>>> [ 1140.698659] RPC: Registered rdma transport module.
>>>>>>>>> [ 1155.697672] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>> [ 1160.688455] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>>> (-103)
>>>>>>>>> [ 1160.693818] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>> [ 1160.695131] svc: failed to register lockdv1 RPC service (errno
>>>>>>>>> 97).
>>>>>>>>> [ 1170.676049] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>>> (-103)
>>>>>>>>> [ 1170.681458] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>> [ 1190.647355] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>>> (-103)
>>>>>>>>> [ 1190.652778] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>> [ 1220.602353] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>>> (-103)
>>>>>>>>> [ 1220.607809] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>> [ 1250.557397] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>>> (-103)
>>>>>>>>> [ 1250.562817] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>> [ 1281.522735] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>>> (-103)
>>>>>>>>> [ 1281.528442] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>> [ 1311.477845] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>>> (-103)
>>>>>>>>> [ 1311.482983] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>> [ 1341.432758] rpcrdma: connection to 192.168.101.1:20049 closed
>>>>>>>>> (-103)
>>>>>>>>> [ 1341.438212] rpcrdma: connection to 192.168.101.1:20049 on mlx4_0,
>>>>>>>>> memreg 5 slots 32 ird 4
>>>>>>>>>
>>>>>>>>> However, at this point my shell session becomes unresponsive if I
>>>>>>>>> attempt so much as a 'ls'. The system hasn't hung completely
>>>>>>>>> however
>>>>>>>>> as I can still connect another ssh session and restart with
>>>>>>>>> $ sudo init 6
>>>>>>>>>
>>>>>>>>> Can anybody help? Is there anything obvious I am doing wrong here?
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>>
>>>>>>>>> Ross
>>>>>>>>> _______________________________________________
>>>>>>>>> ewg mailing list
>>>>>>>>> ewg at lists.openfabrics.org
>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ewg mailing list
>>>>>>>>> ewg at lists.openfabrics.org
>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ewg mailing list
>>>>>>> ewg at lists.openfabrics.org
>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
More information about the ewg
mailing list