[Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]]

Mon Apr 27 05:47:56 PDT 2009

At 06:56 AM 4/27/2009, Celine Bourde wrote:
>Thanks for the explanation.
>Let me know if you have additional information.
>
>We have a contact at Mellanox. I will contact him.
>
>Thanks,
>
>Céline.
>
>Vu Pham wrote:
>> Celine,
>>
>> I'm seeing mlx4 in the log so it is connectX.
>>
>> nfsrdma does not work with any official connectX' fw release 2.6.0 
>> because of fast registering work request problems between nfsrdma and 
>> the firmware.

There is a very simple workaround if you don't have the latest mlx4 firmware.

Just set the client to use the all-physical memory registration mode. This will
avoid making unsupported reregistration requests, which the firmware advertised.

Before mounting, enter (as root)

	sysctl -w sunrpc.rdma_memreg_strategy = 6

The client should work properly after this.

If you do have access to the fixed firmware, I recommend using the default
setting (5) as it provides greater safety on the client.

Tom.

>>
>> We are currently debugging/fixing those problems.
>>
>> Do you have direct contact with Mellanox field application engineer? 
>> Please contact him/her.
>> If not I can send you a contact on private channel.
>>
>> thanks,
>> -vu
>>
>>> Hi Celine,
>>>
>>> What HCA do you have on your system? Is it ConnectX? If yes, what is 
>>> its firmware version?
>>>
>>> -vu
>>>
>>>> Hey Celine,
>>>>
>>>> Thanks for gathering all this info!  So the rdma connections work 
>>>> fine with everything _but_ nfsrdma.  And errno 103 indicates the 
>>>> connection was aborted, maybe by the server (since no failures are 
>>>> logged by the client).
>>>>
>>>>
>>>> More below:
>>>>
>>>>
>>>> Celine Bourde wrote:
>>>>> Hi Steve,
>>>>>
>>>>> This email summarizes the situation:
>>>>>
>>>>> Standard mount -> OK
>>>>> ---------------------
>>>>>
>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/
>>>>> Command works fine.
>>>>>
>>>>> rdma mount -> KO
>>>>> -----------------
>>>>>
>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/
>>>>> Command blocks ! I should perform Ctr+C to kill process.
>>>>>
>>>>> or
>>>>>
>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 
>>>>> /mnt/ -o rdma,port=2050
>>>>> [..]
>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), 
>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
>>>>> fcntl(3, F_SETFL, O_RDWR)               = 0
>>>>> sendto(3, 
>>>>> 
>"-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), 
>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40
>>>>> poll([{fd=3, events=POLLIN}], 1, 3000)  = 1 ([{fd=3, revents=POLLIN}])
>>>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 
>>>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), 
>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24
>>>>> close(3)                                = 0
>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, 
>>>>> "rdma,port=2050,addr=192.168.0.215"
>>>>> ..same problem
>>>>>
>>>>> [root at twind tmp]# dmesg
>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>> 32 ird 16
>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>> 32 ird 16
>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>> 32 ird 16
>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>> 32 ird 16
>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>
>>>>>
>>>>
>>>> Is there anything logged on the server side?
>>>>
>>>> Also, can you try this again, but on both systems do this before 
>>>> attempting the mount:
>>>>
>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug
>>>>
>>>> This will enable all the rpc trace points and add a bunch of logging 
>>>> to /var/log/messages.
>>>> Maybe that will show us something.  It think the server is aborting 
>>>> the connection for some reason.
>>>>
>>>> Steve.
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit 
>>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit 
>>> http://openib.org/mailman/listinfo/openib-general
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>>
>
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>