[Users] infiniband rdma poor transfer bw

Tue Aug 28 09:15:25 PDT 2012

Now my active side looks like this (after the initial connection) and I did
benchmark each phase:

write(buffer) {
  register buffer                <== 0.19 ms
  wait memory region        <== 0.58 ms
  post message receive
  send using WITH_IMM (SIGNALED)
  wait the send is done     <== 12.43 ms (transfering 8MB)
  unregister buffer            <== 0.19 ms
}

So it seems that even zeroing the registration/unregistration times, zeroing
the starving dead time about waiting for the memory region I'm getting:

8MB/12.43 ms = 643 MB/sec  ( and that is still far from what I have to get).

now I have to understand if queuing a second ibv_post_send while
another one is flying
could improve it or not.

To reach 1500MB/sec that 12.43 ms has to around 5ms.

The BW can improve if that send_data is composed by a:
   - preparation phase (HCA setup?)   <== this should be ~7ms
   - real data send                             <== this should be the
target 5ms
so the preparation phase can run in parallel with a transmission going on.

I did try to send 8 bytes to see if there is a fixed ammount behind
each RMDA_WRITE but
doesn't seem so, transferying 8 bytes I don't see those 7ms of fixed
time in the RMDA_WRITE.

I have the doubt that I'm getting a QP (I'm using rdma_create_qp) not
tuned, I saw that ib_write_bw
doesn't use the rdma_create_qp but the couple  ibv_create_qp/ibv_modify_qp.
Shall I switch to create the QP in the same way? I did read around
that is better to use the
rdma_create_qp. In the mean time in linux-rdma.

I'm sure removing registration/unregistration performing more
RMDA_WRITE in parallel can improve
the process but there is still something bad around a loss of
900MB/sec can not due to this.

I have post some experiments made with rsocket and even there under
some condition the transfer
rate is very poor 7 Gb/sec (896 MB/sec) and sometime 10 Gb/sec (1280
MB/sec), I did post my
experiments on linux-rdma mailing list.

Gaetano

On Tue, Aug 28, 2012 at 5:03 PM, David McMillen
<davem at systemfabricworks.com> wrote:
>
>
> On Tue, Aug 28, 2012 at 9:48 AM, Gaetano Mendola <mendola at gmail.com> wrote:
>>
>> ...
>>
>> Indeed my Slop is a PCI-E 2.0 x4 that's why the 1500 MB/sec was what
>> I'm expecting.
>
>
> OK - with a PCIe 2.0 x4 slot you are actually doing well to be doing 1500
> MB/sec.  Any other ideas for improvement cannot work.
>
>>
>>
>> ...
>>
>> This what lspci say about that slot:
>>
>>
>
> You need to use the -vv (two v characters in a row) option to lspci to see
> width information.
>
>>
>> ...
>>
>> I'll try the IBV_WR_RDMA_WRITE_WITH_IMM avoid the send "DONE" message,
>> as I understood with IBV_WR_RDMA_WRITE_WITH_IMM is notified, right ?
>
>
> Yes, when the write is finished the target will get the immediate data
> value.  If things are otherwise going well, and with large transfers like
> you use, it isn't a significant performance difference to just follow the
> rdma write with a send message.  You can queue (post) both at the same time,
> since if the write has an error the send will not happen.
>
>>
>>
>> ...
>>
>> Yes that's an idea, I have to be sure (as already is the case) the
>> buffers are not
>> continuously allocated/deallocated.
>> I'll try to create an hash table buffer -> memory region to avoid those
>> registration/deregistration and I'll post what I get.
>>
>
> It could make your life easier if you created a private allocation pool for
> these buffers.  You could create a memory region to cover the entire pool,
> and then anything allocated from it would be covered by that MR.
>
> Dave
>

-- 
cpp-today.blogspot.com