[ewg] nfsrdma fails to write big file,

Wed Feb 24 16:51:31 PST 2010

Vu,

I ran the number of slots down to 8 (echo 8 > rdma_slot_table_entries) 
and I can reproduce the issue now. I'm going to try setting the 
allocation multiple to 5 and see if I can't prove to myself and Roland 
that we've accurately computed the correct factor.

I think overall a better solution might be a different credit system, 
however, I think that's a much more substantial change than we can 
tackle at this point.

Tom

Tom Tucker wrote:
> Vu,
>
> Based on the mapping code, it looks to me like the worst case is 
> RPCRDMA_MAX_SEGS * 2 + 1 as the multiplier. 
> However, I think in practice, due to the way that iov are built, the 
> actual max is 5 (frmr for head + pagelist plus invalidates for same plus 
> one for the send itself). Why did you think the max was 6?
>
> Thanks,
> Tom
>
> Tom Tucker wrote:
>   
>> Vu,
>>
>> Are you changing any of the default settings? For example rsize/wsize, 
>> etc... I'd like to reproduce this problem if I can.
>>
>> Thanks,
>>
>> Tom
>>
>> Vu Pham wrote:
>>   
>>     
>>> Tom,
>>>
>>> Did you make any change to have bonnie++, dd of a 10G file and vdbench
>>> concurrently run & finish?
>>>
>>> I keep hitting the WQE overflow error below.
>>> I saw that most of the requests have two chunks (32K chunk and
>>> some-bytes chunk), each chunk requires an frmr + invalidate wrs;
>>> However, you set ep->rep_attr.cap.max_send_wr = cdata->max_requests and
>>> then for frmr case you do
>>> ep->rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you
>>> also set ep->rep_cqinit = max_send_wr/2 for send completion signal which
>>> causes the wqe overflow happened faster.
>>>
>>> After applying the following patch, I have thing vdbench, dd, and copy
>>> 10g_file running overnight
>>>
>>> -vu
>>>
>>>
>>> --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c   2010-02-24
>>> 10:41:22.000000000 -0800
>>> +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c        2010-02-24
>>> 10:03:18.000000000 -0800
>>> @@ -649,8 +654,15 @@
>>>         ep->rep_attr.cap.max_send_wr = cdata->max_requests;
>>>         switch (ia->ri_memreg_strategy) {
>>>         case RPCRDMA_FRMR:
>>> -               /* Add room for frmr register and invalidate WRs */
>>> -               ep->rep_attr.cap.max_send_wr *= 3;
>>> +               /* 
>>> +                * Add room for frmr register and invalidate WRs
>>> +                * Requests sometimes have two chunks, each chunk
>>> +                * requires to have different frmr. The safest
>>> +                * WRs required are max_send_wr * 6; however, we
>>> +                * get send completions and poll fast enough, it
>>> +                * is pretty safe to have max_send_wr * 4. 
>>> +                */
>>> +               ep->rep_attr.cap.max_send_wr *= 4;
>>>                 if (ep->rep_attr.cap.max_send_wr > devattr.max_qp_wr)
>>>                         return -EINVAL;
>>>                 break;
>>> @@ -682,7 +694,8 @@
>>>                 ep->rep_attr.cap.max_recv_sge);
>>>
>>>         /* set trigger for requesting send completion */
>>> -       ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
>>> +       ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/4;
>>> +       
>>>         switch (ia->ri_memreg_strategy) {
>>>         case RPCRDMA_MEMWINDOWS_ASYNC:
>>>         case RPCRDMA_MEMWINDOWS:
>>>
>>>
>>>
>>>
>>>
>>>   
>>>     
>>>       
>>>> -----Original Message-----
>>>> From: ewg-bounces at lists.openfabrics.org [mailto:ewg-
>>>> bounces at lists.openfabrics.org] On Behalf Of Vu Pham
>>>> Sent: Monday, February 22, 2010 12:23 PM
>>>> To: Tom Tucker
>>>> Cc: linux-rdma at vger.kernel.org; Mahesh Siddheshwar;
>>>> ewg at lists.openfabrics.org
>>>> Subject: Re: [ewg] nfsrdma fails to write big file,
>>>>
>>>> Tom,
>>>>
>>>> Some more info on the problem:
>>>> 1. Running with memreg=4 (FMR) I can not reproduce the problem
>>>> 2. I also see different error on client
>>>>
>>>> Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name
>>>> 'nobody'
>>>> does not map into domain 'localdomain'
>>>> Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow
>>>> Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
>>>> Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
>>>> Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send
>>>> returned -12 cq_init 48 cq_count 32
>>>> Feb 22 12:17:00 mellanox-2 kernel: RPC:       rpcrdma_event_process:
>>>> send WC status 5, vend_err F5
>>>> Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to
>>>> 13.20.1.9:20049 closed (-103)
>>>>
>>>> -vu
>>>>
>>>>     
>>>>       
>>>>         
>>>>> -----Original Message-----
>>>>> From: Tom Tucker [mailto:tom at opengridcomputing.com]
>>>>> Sent: Monday, February 22, 2010 10:49 AM
>>>>> To: Vu Pham
>>>>> Cc: linux-rdma at vger.kernel.org; Mahesh Siddheshwar;
>>>>> ewg at lists.openfabrics.org
>>>>> Subject: Re: [ewg] nfsrdma fails to write big file,
>>>>>
>>>>> Vu Pham wrote:
>>>>>       
>>>>>         
>>>>>           
>>>>>> Setup:
>>>>>> 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600,
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> ConnectX2
>>>>>       
>>>>>         
>>>>>           
>>>>>> QDR HCAs fw 2.7.8-6, RHEL 5.2.
>>>>>> 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA.
>>>>>>
>>>>>>
>>>>>> Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M
>>>>>> count=10000*, operation fail, connection get drop, client cannot
>>>>>> re-establish connection to server.
>>>>>> After rebooting only the client, I can mount again.
>>>>>>
>>>>>> It happens with both solaris and linux nfsrdma servers.
>>>>>>
>>>>>> For linux client/server, I run memreg=5 (FRMR), I don't see
>>>>>>         
>>>>>>           
>>>>>>             
>>> problem
>>>   
>>>     
>>>       
>>>>> with
>>>>>       
>>>>>         
>>>>>           
>>>>>> memreg=6 (global dma key)
>>>>>>
>>>>>>
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> Awesome. This is the key I think.
>>>>>
>>>>> Thanks for the info Vu,
>>>>> Tom
>>>>>
>>>>>
>>>>>       
>>>>>         
>>>>>           
>>>>>> On Solaris server snv 130, we see problem decoding write request
>>>>>>         
>>>>>>           
>>>>>>             
>>> of
>>>   
>>>     
>>>       
>>>>> 32K.
>>>>>       
>>>>>         
>>>>>           
>>>>>> The client send two read chunks (32K & 16-byte), the server fail
>>>>>>         
>>>>>>           
>>>>>>             
>>> to
>>>   
>>>     
>>>       
>>>>> do
>>>>>       
>>>>>         
>>>>>           
>>>>>> rdma read on the 16-byte chunk (cqe.status = 10 ie.
>>>>>> IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the
>>>>>>         
>>>>>>           
>>>>>>             
>>>> connection.
>>>>     
>>>>       
>>>>         
>>>>> We
>>>>>       
>>>>>         
>>>>>           
>>>>>> don't see this problem on nfs version 3 on Solaris. Solaris server
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> run
>>>>>       
>>>>>         
>>>>>           
>>>>>> normal memory registration mode.
>>>>>>
>>>>>> On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR
>>>>>>
>>>>>> I added these notes in bug #1919 (bugs.openfabrics.org) to track
>>>>>>         
>>>>>>           
>>>>>>             
>>>> the
>>>>     
>>>>       
>>>>         
>>>>>> issue.
>>>>>>
>>>>>> thanks,
>>>>>> -vu
>>>>>> _______________________________________________
>>>>>> ewg mailing list
>>>>>> ewg at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>>>
>>>>>>         
>>>>>>           
>>>>>>             
>>>> _______________________________________________
>>>> ewg mailing list
>>>> ewg at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>     
>>>>       
>>>>         
>>> _______________________________________________
>>> ewg mailing list
>>> ewg at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>   
>>>     
>>>       
>> _______________________________________________
>> ewg mailing list
>> ewg at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>   
>>     
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>