[Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]]

Tom Talpey tmtalpey at gmail.com
Mon Apr 27 07:50:06 PDT 2009


At 10:05 AM 4/27/2009, Celine Bourde wrote:
>We have still the same problem, even changing the registration method.
>
>mount doesn't reply and this is the output of dmesg on client:
>
>rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
>rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
>rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>ib0: multicast join failed for 
>ff12:401b:ffff:0000:0000:0000:0000:0001, status -22
>rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
>rpcrdma: connection to 192.168.0.215:2050 closed (-103)

I need to see the log on the server. Errno 103 is ECONNABORTED which means
the connection was closed spontaneously. Let's look for a server artifact.

>
>I have still another doubt: if the firmware is the problem, why is NFS 
>RDMA working with a kernel 2.6.27.10 and without OFED 1.4 with these 
>same cards??

There were a number of changes in the 2.6.28 cycle, especially on the
server. So it's quite possible that 2.6.27, without the changes, would behave
differently. Have you tried this with 2.6.29, or with different cards?

Tom.

>
>Thanks,
>
>Céline Bourde. 
>
>
>Tom Talpey wrote:
>> At 06:56 AM 4/27/2009, Celine Bourde wrote:
>>   
>>> Thanks for the explanation.
>>> Let me know if you have additional information.
>>>
>>> We have a contact at Mellanox. I will contact him.
>>>
>>> Thanks,
>>>
>>> Céline.
>>>
>>> Vu Pham wrote:
>>>     
>>>> Celine,
>>>>
>>>> I'm seeing mlx4 in the log so it is connectX.
>>>>
>>>> nfsrdma does not work with any official connectX' fw release 2.6.0 
>>>> because of fast registering work request problems between nfsrdma and 
>>>> the firmware.
>>>>       
>>
>> There is a very simple workaround if you don't have the latest mlx4 firmware.
>>
>> Just set the client to use the all-physical memory registration 
>mode. This will
>> avoid making unsupported reregistration requests, which the firmware 
>advertised.
>>
>> Before mounting, enter (as root)
>>
>> 	sysctl -w sunrpc.rdma_memreg_strategy = 6
>>
>> The client should work properly after this.
>>
>> If you do have access to the fixed firmware, I recommend using the default
>> setting (5) as it provides greater safety on the client.
>>
>> Tom.
>>
>>   
>>>> We are currently debugging/fixing those problems.
>>>>
>>>> Do you have direct contact with Mellanox field application engineer? 
>>>> Please contact him/her.
>>>> If not I can send you a contact on private channel.
>>>>
>>>> thanks,
>>>> -vu
>>>>
>>>>       
>>>>> Hi Celine,
>>>>>
>>>>> What HCA do you have on your system? Is it ConnectX? If yes, what is 
>>>>> its firmware version?
>>>>>
>>>>> -vu
>>>>>
>>>>>         
>>>>>> Hey Celine,
>>>>>>
>>>>>> Thanks for gathering all this info!  So the rdma connections work 
>>>>>> fine with everything _but_ nfsrdma.  And errno 103 indicates the 
>>>>>> connection was aborted, maybe by the server (since no failures are 
>>>>>> logged by the client).
>>>>>>
>>>>>>
>>>>>> More below:
>>>>>>
>>>>>>
>>>>>> Celine Bourde wrote:
>>>>>>           
>>>>>>> Hi Steve,
>>>>>>>
>>>>>>> This email summarizes the situation:
>>>>>>>
>>>>>>> Standard mount -> OK
>>>>>>> ---------------------
>>>>>>>
>>>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/
>>>>>>> Command works fine.
>>>>>>>
>>>>>>> rdma mount -> KO
>>>>>>> -----------------
>>>>>>>
>>>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/
>>>>>>> Command blocks ! I should perform Ctr+C to kill process.
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 
>>>>>>> /mnt/ -o rdma,port=2050
>>>>>>> [..]
>>>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
>>>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
>>>>>>> fcntl(3, F_SETFL, O_RDWR)               = 0
>>>>>>> sendto(3, 
>>>>>>>
>>>>>>>             
>>> 
>"-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>     
>>>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40
>>>>>>> poll([{fd=3, events=POLLIN}], 1, 3000)  = 1 ([{fd=3, revents=POLLIN}])
>>>>>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 
>>>>>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), 
>>>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24
>>>>>>> close(3)                                = 0
>>>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, 
>>>>>>> "rdma,port=2050,addr=192.168.0.215"
>>>>>>> ..same problem
>>>>>>>
>>>>>>> [root at twind tmp]# dmesg
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>>> 32 ird 16
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>>> 32 ird 16
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>>> 32 ird 16
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 
>>>>>>> 32 ird 16
>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103)
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>> Is there anything logged on the server side?
>>>>>>
>>>>>> Also, can you try this again, but on both systems do this before 
>>>>>> attempting the mount:
>>>>>>
>>>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug
>>>>>>
>>>>>> This will enable all the rpc trace points and add a bunch of logging 
>>>>>> to /var/log/messages.
>>>>>> Maybe that will show us something.  It think the server is aborting 
>>>>>> the connection for some reason.
>>>>>>
>>>>>> Steve.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> general mailing list
>>>>>> general at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>
>>>>>> To unsubscribe, please visit 
>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>           
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general at lists.openfabrics.org
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit 
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>         
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit 
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>>
>>>>       
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general
>>>
>>>     
>>
>>
>>
>>   
>
>




More information about the general mailing list