[Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]]

Celine Bourde celine.bourde at ext.bull.net
Mon Apr 27 07:05:33 PDT 2009


We have still the same problem, even changing the registration method.

mount doesn't reply and this is the output of dmesg on client:

rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
rpcrdma: connection to 192.168.0.215:2050 closed (-103)
rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
rpcrdma: connection to 192.168.0.215:2050 closed (-103)
ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -22
rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
rpcrdma: connection to 192.168.0.215:2050 closed (-103)

I have still another doubt: if the firmware is the problem, why is NFS 
RDMA working with a kernel 2.6.27.10 and without OFED 1.4 with these 
same cards??

Thanks,

Céline Bourde. 


Tom Talpey wrote:
> At 06:56 AM 4/27/2009, Celine Bourde wrote:
>   
>> Thanks for the explanation.
>> Let me know if you have additional information.
>>
>> We have a contact at Mellanox. I will contact him.
>>
>> Thanks,
>>
>> Céline.
>>
>> Vu Pham wrote:
>>     
>>> Celine,
>>>
>>> I'm seeing mlx4 in the log so it is connectX.
>>>
>>> nfsrdma does not work with any official connectX' fw release 2.6.0 
>>> because of fast registering work request problems between nfsrdma and 
>>> the firmware.
>>>       
>
> There is a very simple workaround if you don't have the latest mlx4 firmware.
>
> Just set the client to use the all-physical memory registration mode. This will
> avoid making unsupported reregistration requests, which the firmware advertised.
>
> Before mounting, enter (as root)
>
> 	sysctl -w sunrpc.rdma_memreg_strategy = 6
>
> The client should work properly after this.
>
> If you do have access to the fixed firmware, I recommend using the default
> setting (5) as it provides greater safety on the client.
>
> Tom.
>
>   
>>> We are currently debugging/fixing those problems.
>>>
>>> Do you have direct contact with Mellanox field application engineer? 
>>> Please contact him/her.
>>> If not I can send you a contact on private channel.
>>>
>>> thanks,
>>> -vu
>>>
>>>       
>>>> Hi Celine,
>>>>
>>>> What HCA do you have on your system? Is it ConnectX? If yes, what is 
>>>> its firmware version?
>>>>
>>>> -vu
>>>>
>>>>         
>>>>> Hey Celine,
>>>>>
>>>>> Thanks for gathering all this info!  So the rdma connections work 
>>>>> fine with everything _but_ nfsrdma.  And errno 103 indicates the 
>>>>> connection was aborted, maybe by the server (since no failures are 
>>>>> logged by the client).
>>>>>
>>>>>
>>>>> More below:
>>>>>
>>>>>
>>>>> Celine Bourde wrote:
>>>>>           
>>>>>> Hi Steve,
>>>>>>
>>>>>> This email summarizes the situation:
>>>>>>
>>>>>> Standard mount -> OK
>>>>>> ---------------------
>>>>>>
>>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/
>>>>>> Command works fine.
>>>>>>
>>>>>> rdma mount -> KO
>>>>>> -----------------
>>>>>>
>>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/
>>>>>> Command blocks ! I should perform Ctr+C to kill process.
>>>>>>
>>>>>> or
>>>>>>
>>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 
>>>>>> /mnt/ -o rdma,port=2050
>>>>>> [..]
>>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
>>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
>>>>>> fcntl(3, F_SETFL, O_RDWR)               = 0
>>>>>> sendto(3, 
>>>>>>
>>>>>>             
>> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>     
>>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40
>>>>>> poll([{fd=3, events=POLLIN}], 1, 3000)  = 1 ([{fd=3, revents=POLLIN}])
>>>>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 
>>>>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24
>>>>>> close(3)                                = 0
>>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, 
>>>>>> "rdma,port=2050,addr=192.168.0.215"
>>>>>> ..same problem
>>>>>>
>>>>>> [root at twind tmp]# dmesg
>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>> 32 ird 16
>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>> 32 ird 16
>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>> 32 ird 16
>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>> 32 ird 16
>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>
>>>>>>
>>>>>>             
>>>>> Is there anything logged on the server side?
>>>>>
>>>>> Also, can you try this again, but on both systems do this before 
>>>>> attempting the mount:
>>>>>
>>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug
>>>>>
>>>>> This will enable all the rpc trace points and add a bunch of logging 
>>>>> to /var/log/messages.
>>>>> Maybe that will show us something.  It think the server is aborting 
>>>>> the connection for some reason.
>>>>>
>>>>> Steve.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general at lists.openfabrics.org
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit 
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>           
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit 
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>         
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit 
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>
>>>       
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>>     
>
>
>
>   




More information about the general mailing list