[Users] infiniband rdma poor transfer bw

Thu Aug 23 15:51:05 PDT 2012

Hi all,
I'm sorry in advance if this is not the right mailing list for my question.

In my application I use an infiniband infrastructure to send a stream
of data from a server to
another one. I have used to easy the development ip over infiniband
because I'm more familiar
with socket programming. Until now the performance (max bw) was good
enough for me (I knew
I wasn't getting the maximum bandwith achievable), now I need to get
out from that infiniband
connection more bandwidth.

ib_write_bw claims that my max achievable bandwidth is around 1500
MB/s (I'm not getting
3000MB/s because my card is installed in a PCI 2.0 8x).

So far so good. I coded my communication channel using ibverbs and
rdma but I'm getting far
less than the bandwith I can get, I'm even getting a bit less
bandwidth than using socket but
at least my application doesn't use any CPU power:

ib_write_bw: 1500 MB/s

sockets: 700 MB/s <= One core of my system is at 100% during this test

ibvers+rdma: 600 MB/s <= No CPU is used at all during this test

It seems that the bottleneck is here:

ibv_sge sge;
sge.addr = (uintptr_t)memory_to_transfer;
sge.length = memory_to_transfer_size;
sge.lkey = memory_to_transfer_mr->lkey;

ibv_send_wr wr;
memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;

ibv_send_wr *bad_wr = NULL;
if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) {
  notifyError("Unable to ibv post receive");
}

at this point the next code waiting for completation that is:

//Wait for completation
ibv_cq *cq;
void* cq_context;
if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) {
  notifyError("Unable to get a ibv cq event");
}

ibv_ack_cq_events(cq, 1);

if (ibv_req_notify_cq(cq, 0) != 0) {
  notifyError("Unable to get a req notify");
}

ibv_wc wc;
int myRet = ibv_poll_cq(cq, 1, &wc);
if (myRet > 1) {
  LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
}

The time from my ibv_post_send and when ibv_get_cq_event returns an
event is 13.3ms when transfering chuncks of 8 MB achieving then around 600 MB/s.

To specify more (in pseudocode what I do globally):

Active Side:

post a message receive
rdma connection
wait for rdma connection event
<<at this point transfer tx flow starts>>
start:
register memory containing bytes to transfer
wait remote memory region addr/key ( I wait for a ibv_wc)
send data with ibv_post_send
post a message receive
wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms)
send message "DONE"
unregister memory
goto start

Passive Side:

post a message receive
rdma accept
wait for rdma connection event
<<at this point transfer rx flow starts>>
start:
register memory that has to receive the bytes
send addr/key of memory registered
wait "DONE" message
unregister memory
post a message receive
goto start

Does anyone knows what I'm doing wrong? Or what I can improve? I'm not
affected by
"Not Invented Here" syndrome so I'm even open to throw away what I
have done until
now and adopting something else.

I only need a point to point contiguous transfer.

Regards
Gaetano Mendola

--
cpp-today.blogspot.com