[Users] infiniband rdma poor transfer bw
Ira Weiny
weiny2 at llnl.gov
Mon Aug 27 09:47:26 PDT 2012
Gaetano,
Yes this is the correct list. Did you also post a similar message to linux-rdma? I seem to recall a similar thread there. If so I think Sean gave some good advice and you should follow that. If that was not you, see my response (from limited experience below.)
On Fri, 24 Aug 2012 00:51:05 +0200
Gaetano Mendola <mendola at gmail.com> wrote:
> Hi all,
> I'm sorry in advance if this is not the right mailing list for my question.
>
> In my application I use an infiniband infrastructure to send a stream
> of data from a server to
> another one. I have used to easy the development ip over infiniband
> because I'm more familiar
> with socket programming. Until now the performance (max bw) was good
> enough for me (I knew
> I wasn't getting the maximum bandwith achievable), now I need to get
> out from that infiniband
> connection more bandwidth.
Getting good performance can be tricky with RDMA. The most difficulty I have had (and I have read/heard about) is dealing with memory registrations.
>
> ib_write_bw claims that my max achievable bandwidth is around 1500
> MB/s (I'm not getting
> 3000MB/s because my card is installed in a PCI 2.0 8x).
>
> So far so good. I coded my communication channel using ibverbs and
> rdma but I'm getting far
> less than the bandwith I can get, I'm even getting a bit less
> bandwidth than using socket but
> at least my application doesn't use any CPU power:
>
> ib_write_bw: 1500 MB/s
>
> sockets: 700 MB/s <= One core of my system is at 100% during this test
>
> ibvers+rdma: 600 MB/s <= No CPU is used at all during this test
>
> It seems that the bottleneck is here:
>
> ibv_sge sge;
> sge.addr = (uintptr_t)memory_to_transfer;
> sge.length = memory_to_transfer_size;
> sge.lkey = memory_to_transfer_mr->lkey;
>
> ibv_send_wr wr;
> memset(&wr, 0, sizeof(wr));
> wr.wr_id = 0;
> wr.opcode = IBV_WR_RDMA_WRITE;
Generally, I have thought that RDMA READ is easier to deal with than RDMA WRITE. As you have found, when you do a RDMA WRITE there is an extra RDMA_SEND step to tell the remote side the write has been completed. If the remote side does a RDMA_READ then they will know the data is available when they see the WC come back on that end. So the only "extra" send/recv required for verbs is the initial transfer of the ETH (addr, size, rkey) information.
> wr.sg_list = &sge;
> wr.num_sge = 1;
> wr.send_flags = IBV_SEND_SIGNALED;
> wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
> wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;
>
> ibv_send_wr *bad_wr = NULL;
> if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) {
> notifyError("Unable to ibv post receive");
> }
>
> at this point the next code waiting for completation that is:
>
> //Wait for completation
> ibv_cq *cq;
> void* cq_context;
> if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) {
> notifyError("Unable to get a ibv cq event");
> }
>
> ibv_ack_cq_events(cq, 1);
>
> if (ibv_req_notify_cq(cq, 0) != 0) {
> notifyError("Unable to get a req notify");
> }
>
> ibv_wc wc;
> int myRet = ibv_poll_cq(cq, 1, &wc);
> if (myRet > 1) {
> LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
> }
>
>
> The time from my ibv_post_send and when ibv_get_cq_event returns an
> event is 13.3ms when transfering chuncks of 8 MB achieving then around 600 MB/s.
It looks like you are waiting for a completion before doing another xfer? Is this the case? That may not be the most efficient.
>
> To specify more (in pseudocode what I do globally):
>
> Active Side:
>
> post a message receive
> rdma connection
> wait for rdma connection event
> <<at this point transfer tx flow starts>>
> start:
> register memory containing bytes to transfer
I believe Sean mentioned you should avoid doing memory registration in any areas of the code where performance is critical. I agree with him.
> wait remote memory region addr/key ( I wait for a ibv_wc)
> send data with ibv_post_send
> post a message receive
> wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms)
> send message "DONE"
> unregister memory
This applies to unregistration of memory as well.
> goto start
>
> Passive Side:
>
> post a message receive
> rdma accept
> wait for rdma connection event
> <<at this point transfer rx flow starts>>
> start:
> register memory that has to receive the bytes
> send addr/key of memory registered
> wait "DONE" message
> unregister memory
> post a message receive
> goto start
>
> Does anyone knows what I'm doing wrong? Or what I can improve? I'm not
> affected by
> "Not Invented Here" syndrome so I'm even open to throw away what I
> have done until
> now and adopting something else.
>
> I only need a point to point contiguous transfer.
How big is this transfer?
It may be that doing send/recv or write with immediate would work better for you. Also have you seen Sean's rsocket project?
Hope this helps,
Ira
>
>
> Regards
> Gaetano Mendola
>
>
> --
> cpp-today.blogspot.com
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
--
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov
More information about the Users
mailing list