[Users] infiniband rdma poor transfer bw

Fri Aug 31 13:26:01 PDT 2012

On Tue, Aug 28, 2012 at 12:15 PM, Gaetano Mendola <mendola at gmail.com> wrote:
> Now my active side looks like this (after the initial connection) and I did
> benchmark each phase:
>
> write(buffer) {
>   register buffer                <== 0.19 ms
>   wait memory region        <== 0.58 ms
>   post message receive
>   send using WITH_IMM (SIGNALED)
>   wait the send is done     <== 12.43 ms (transfering 8MB)
>   unregister buffer            <== 0.19 ms
> }
>
> So it seems that even zeroing the registration/unregistration times, zeroing
> the starving dead time about waiting for the memory region I'm getting:
>
> 8MB/12.43 ms = 643 MB/sec  ( and that is still far from what I have to get).
>
> now I have to understand if queuing a second ibv_post_send while
> another one is flying
> could improve it or not.

Enlarge the number of packages in flight should definitely improve
your ramd application performance. The programming style in libibverbs
and librdmacm is almost asynchronize everywhere, and to send multiple
packages in flight is a way to fill the FAT pipe. In my experience,
performance of iodepth with 32 is able to be 100% or more better than
that with 1. Also, block size is not the larger the better. 256KB~2MB
block size with 16~64 iodepth (the value also could be calculated by
BDP) is sufficient to fill up 40Gbps LAN.

Theoretically, app is not able to fill the pipe with only 1 buffer
even avoid memory registration/deregistration. Regarding you code,
data transfer is already done in the NIC level before you get an
completion event in the application level. Your application have to
handle those async event. As a reslt, it leave the network idle for
some cycles if there is only one buffer. So your calculation is not
such accurate to match the reality.

>
> To reach 1500MB/sec that 12.43 ms has to around 5ms.
>
> The BW can improve if that send_data is composed by a:
>    - preparation phase (HCA setup?)   <== this should be ~7ms
>    - real data send                             <== this should be the
> target 5ms
> so the preparation phase can run in parallel with a transmission going on.
>
> I did try to send 8 bytes to see if there is a fixed ammount behind
> each RMDA_WRITE but
> doesn't seem so, transferying 8 bytes I don't see those 7ms of fixed
> time in the RMDA_WRITE.
>
> I have the doubt that I'm getting a QP (I'm using rdma_create_qp) not
> tuned, I saw that ib_write_bw
> doesn't use the rdma_create_qp but the couple  ibv_create_qp/ibv_modify_qp.
> Shall I switch to create the QP in the same way? I did read around
> that is better to use the
> rdma_create_qp. In the mean time in linux-rdma.
>
> I'm sure removing registration/unregistration performing more
> RMDA_WRITE in parallel can improve
> the process but there is still something bad around a loss of
> 900MB/sec can not due to this.
>
> I have post some experiments made with rsocket and even there under
> some condition the transfer
> rate is very poor 7 Gb/sec (896 MB/sec) and sometime 10 Gb/sec (1280
> MB/sec), I did post my
> experiments on linux-rdma mailing list.
>
> Gaetano
>
> On Tue, Aug 28, 2012 at 5:03 PM, David McMillen
> <davem at systemfabricworks.com> wrote:
>>
>>
>> On Tue, Aug 28, 2012 at 9:48 AM, Gaetano Mendola <mendola at gmail.com> wrote:
>>>
>>> ...
>>>
>>> Indeed my Slop is a PCI-E 2.0 x4 that's why the 1500 MB/sec was what
>>> I'm expecting.
>>
>>
>> OK - with a PCIe 2.0 x4 slot you are actually doing well to be doing 1500
>> MB/sec.  Any other ideas for improvement cannot work.
>>
>>>
>>>
>>> ...
>>>
>>> This what lspci say about that slot:
>>>
>>>
>>
>> You need to use the -vv (two v characters in a row) option to lspci to see
>> width information.
>>
>>>
>>> ...
>>>
>>> I'll try the IBV_WR_RDMA_WRITE_WITH_IMM avoid the send "DONE" message,
>>> as I understood with IBV_WR_RDMA_WRITE_WITH_IMM is notified, right ?
>>
>>
>> Yes, when the write is finished the target will get the immediate data
>> value.  If things are otherwise going well, and with large transfers like
>> you use, it isn't a significant performance difference to just follow the
>> rdma write with a send message.  You can queue (post) both at the same time,
>> since if the write has an error the send will not happen.
>>
>>>
>>>
>>> ...
>>>
>>> Yes that's an idea, I have to be sure (as already is the case) the
>>> buffers are not
>>> continuously allocated/deallocated.
>>> I'll try to create an hash table buffer -> memory region to avoid those
>>> registration/deregistration and I'll post what I get.
>>>
>>
>> It could make your life easier if you created a private allocation pool for
>> these buffers.  You could create a memory region to cover the entire pool,
>> and then anything allocated from it would be covered by that MR.
>>
>> Dave
>>
>
>
>
> --
> cpp-today.blogspot.com
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users

Thank you

Yufei