[Users] infiniband rdma poor transfer bw

Mon Aug 27 09:47:26 PDT 2012

Gaetano,

Yes this is the correct list.  Did you also post a similar message to linux-rdma?  I seem to recall a similar thread there.  If so I think Sean gave some good advice and you should follow that.  If that was not you, see my response (from limited experience below.)

On Fri, 24 Aug 2012 00:51:05 +0200
Gaetano Mendola <mendola at gmail.com> wrote:

> Hi all,
> I'm sorry in advance if this is not the right mailing list for my question.
> 
> In my application I use an infiniband infrastructure to send a stream
> of data from a server to
> another one. I have used to easy the development ip over infiniband
> because I'm more familiar
> with socket programming. Until now the performance (max bw) was good
> enough for me (I knew
> I wasn't getting the maximum bandwith achievable), now I need to get
> out from that infiniband
> connection more bandwidth.

Getting good performance can be tricky with RDMA.  The most difficulty I have had (and I have read/heard about) is dealing with memory registrations.

> 
> ib_write_bw claims that my max achievable bandwidth is around 1500
> MB/s (I'm not getting
> 3000MB/s because my card is installed in a PCI 2.0 8x).
> 
> So far so good. I coded my communication channel using ibverbs and
> rdma but I'm getting far
> less than the bandwith I can get, I'm even getting a bit less
> bandwidth than using socket but
> at least my application doesn't use any CPU power:
> 
> ib_write_bw: 1500 MB/s
> 
> sockets: 700 MB/s <= One core of my system is at 100% during this test
> 
> ibvers+rdma: 600 MB/s <= No CPU is used at all during this test
> 
> It seems that the bottleneck is here:
> 
> ibv_sge sge;
> sge.addr = (uintptr_t)memory_to_transfer;
> sge.length = memory_to_transfer_size;
> sge.lkey = memory_to_transfer_mr->lkey;
> 
> ibv_send_wr wr;
> memset(&wr, 0, sizeof(wr));
> wr.wr_id = 0;
> wr.opcode = IBV_WR_RDMA_WRITE;

Generally, I have thought that RDMA READ is easier to deal with than RDMA WRITE.  As you have found, when you do a RDMA WRITE there is an extra RDMA_SEND step to tell the remote side the write has been completed.  If the remote side does a RDMA_READ then they will know the data is available when they see the WC come back on that end.  So the only "extra" send/recv required for verbs is the initial transfer of the ETH (addr, size, rkey) information.

> wr.sg_list = &sge;
> wr.num_sge = 1;
> wr.send_flags = IBV_SEND_SIGNALED;
> wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
> wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;
> 
> ibv_send_wr *bad_wr = NULL;
> if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) {
>   notifyError("Unable to ibv post receive");
> }
> 
> at this point the next code waiting for completation that is:
> 
> //Wait for completation
> ibv_cq *cq;
> void* cq_context;
> if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) {
>   notifyError("Unable to get a ibv cq event");
> }
> 
> ibv_ack_cq_events(cq, 1);
> 
> if (ibv_req_notify_cq(cq, 0) != 0) {
>   notifyError("Unable to get a req notify");
> }
> 
> ibv_wc wc;
> int myRet = ibv_poll_cq(cq, 1, &wc);
> if (myRet > 1) {
>   LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
> }
> 
> 
> The time from my ibv_post_send and when ibv_get_cq_event returns an
> event is 13.3ms when transfering chuncks of 8 MB achieving then around 600 MB/s.

It looks like you are waiting for a completion before doing another xfer?  Is this the case?  That may not be the most efficient.

> 
> To specify more (in pseudocode what I do globally):
> 
> Active Side:
> 
> post a message receive
> rdma connection
> wait for rdma connection event
> <<at this point transfer tx flow starts>>
> start:
> register memory containing bytes to transfer

I believe Sean mentioned you should avoid doing memory registration in any areas of the code where performance is critical.  I agree with him.

> wait remote memory region addr/key ( I wait for a ibv_wc)
> send data with ibv_post_send
> post a message receive
> wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms)
> send message "DONE"
> unregister memory

This applies to unregistration of memory as well.

> goto start
> 
> Passive Side:
> 
> post a message receive
> rdma accept
> wait for rdma connection event
> <<at this point transfer rx flow starts>>
> start:
> register memory that has to receive the bytes
> send addr/key of memory registered
> wait "DONE" message
> unregister memory
> post a message receive
> goto start
> 
> Does anyone knows what I'm doing wrong? Or what I can improve? I'm not
> affected by
> "Not Invented Here" syndrome so I'm even open to throw away what I
> have done until
> now and adopting something else.
> 
> I only need a point to point contiguous transfer.

How big is this transfer?

It may be that doing send/recv or write with immediate would work better for you.  Also have you seen Sean's rsocket project?

Hope this helps,
Ira

> 
> 
> Regards
> Gaetano Mendola
> 
> 
> --
> cpp-today.blogspot.com
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users

-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov