[Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]]

jeffrey Lang jrlang at uwyo.edu
Mon Apr 27 07:46:02 PDT 2009


I recently was having the "ib0: multicast join failed" issue.   Once i 
upgraded the firmware in my switch everything started working again.

I would give the firmware upgrade a try.

jeff


Celine Bourde wrote:
> We have still the same problem, even changing the registration method.
>
> mount doesn't reply and this is the output of dmesg on client:
>
> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
> ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -22
> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>
> I have still another doubt: if the firmware is the problem, why is NFS 
> RDMA working with a kernel 2.6.27.10 and without OFED 1.4 with these 
> same cards??
>
> Thanks,
>
> Céline Bourde. 
>
>
> Tom Talpey wrote:
>   
>> At 06:56 AM 4/27/2009, Celine Bourde wrote:
>>   
>>     
>>> Thanks for the explanation.
>>> Let me know if you have additional information.
>>>
>>> We have a contact at Mellanox. I will contact him.
>>>
>>> Thanks,
>>>
>>> Céline.
>>>
>>> Vu Pham wrote:
>>>     
>>>       
>>>> Celine,
>>>>
>>>> I'm seeing mlx4 in the log so it is connectX.
>>>>
>>>> nfsrdma does not work with any official connectX' fw release 2.6.0 
>>>> because of fast registering work request problems between nfsrdma and 
>>>> the firmware.
>>>>       
>>>>         
>> There is a very simple workaround if you don't have the latest mlx4 firmware.
>>
>> Just set the client to use the all-physical memory registration mode. This will
>> avoid making unsupported reregistration requests, which the firmware advertised.
>>
>> Before mounting, enter (as root)
>>
>> 	sysctl -w sunrpc.rdma_memreg_strategy = 6
>>
>> The client should work properly after this.
>>
>> If you do have access to the fixed firmware, I recommend using the default
>> setting (5) as it provides greater safety on the client.
>>
>> Tom.
>>
>>   
>>     
>>>> We are currently debugging/fixing those problems.
>>>>
>>>> Do you have direct contact with Mellanox field application engineer? 
>>>> Please contact him/her.
>>>> If not I can send you a contact on private channel.
>>>>
>>>> thanks,
>>>> -vu
>>>>
>>>>       
>>>>         
>>>>> Hi Celine,
>>>>>
>>>>> What HCA do you have on your system? Is it ConnectX? If yes, what is 
>>>>> its firmware version?
>>>>>
>>>>> -vu
>>>>>
>>>>>         
>>>>>           
>>>>>> Hey Celine,
>>>>>>
>>>>>> Thanks for gathering all this info!  So the rdma connections work 
>>>>>> fine with everything _but_ nfsrdma.  And errno 103 indicates the 
>>>>>> connection was aborted, maybe by the server (since no failures are 
>>>>>> logged by the client).
>>>>>>
>>>>>>
>>>>>> More below:
>>>>>>
>>>>>>
>>>>>> Celine Bourde wrote:
>>>>>>           
>>>>>>             
>>>>>>> Hi Steve,
>>>>>>>
>>>>>>> This email summarizes the situation:
>>>>>>>
>>>>>>> Standard mount -> OK
>>>>>>> ---------------------
>>>>>>>
>>>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/
>>>>>>> Command works fine.
>>>>>>>
>>>>>>> rdma mount -> KO
>>>>>>> -----------------
>>>>>>>
>>>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/
>>>>>>> Command blocks ! I should perform Ctr+C to kill process.
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 
>>>>>>> /mnt/ -o rdma,port=2050
>>>>>>> [..]
>>>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
>>>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
>>>>>>> fcntl(3, F_SETFL, O_RDWR)               = 0
>>>>>>> sendto(3, 
>>>>>>>
>>>>>>>             
>>>>>>>               
>>> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>     
>>>       
>>>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40
>>>>>>> poll([{fd=3, events=POLLIN}], 1, 3000)  = 1 ([{fd=3, revents=POLLIN}])
>>>>>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 
>>>>>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24
>>>>>>> close(3)                                = 0
>>>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, 
>>>>>>> "rdma,port=2050,addr=192.168.0.215"
>>>>>>> ..same problem
>>>>>>>
>>>>>>> [root at twind tmp]# dmesg
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>>> 32 ird 16
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>>> 32 ird 16
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>>> 32 ird 16
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>>> 32 ird 16
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>               
>>>>>> Is there anything logged on the server side?
>>>>>>
>>>>>> Also, can you try this again, but on both systems do this before 
>>>>>> attempting the mount:
>>>>>>
>>>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug
>>>>>>
>>>>>> This will enable all the rpc trace points and add a bunch of logging 
>>>>>> to /var/log/messages.
>>>>>> Maybe that will show us something.  It think the server is aborting 
>>>>>> the connection for some reason.
>>>>>>
>>>>>> Steve.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> general mailing list
>>>>>> general at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>
>>>>>> To unsubscribe, please visit 
>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>           
>>>>>>             
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general at lists.openfabrics.org
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit 
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>         
>>>>>           
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit 
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>>
>>>>       
>>>>         
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>>>     
>>>       
>>
>>   
>>     
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090427/b06cadbd/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jrlang.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090427/b06cadbd/attachment.vcf>


More information about the general mailing list