[Users] infiniband rdma poor transfer bw
Ira Weiny
weiny2 at llnl.gov
Mon Aug 27 15:19:13 PDT 2012
On Mon, 27 Aug 2012 23:21:35 +0200
Gaetano Mendola <mendola at gmail.com> wrote:
> On Mon, Aug 27, 2012 at 6:47 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > Gaetano,
> >
> > Yes this is the correct list. Did you also post a similar message to linux-rdma? I seem to recall a similar thread there. If so I think Sean gave some good advice and you should follow that. If that was not you, see my response (from limited experience below.)
>
> I'll write to linux-rdma as soon I have collected some other data from
> experiments,
> I replied inline:
>
> > On Fri, 24 Aug 2012 00:51:05 +0200
> > Gaetano Mendola <mendola at gmail.com> wrote:
> >
> >> Hi all,
> >> I'm sorry in advance if this is not the right mailing list for my question.
> >>
> >> In my application I use an infiniband infrastructure to send a stream
> >> of data from a server to
> >> another one. I have used to easy the development ip over infiniband
> >> because I'm more familiar
> >> with socket programming. Until now the performance (max bw) was good
> >> enough for me (I knew
> >> I wasn't getting the maximum bandwith achievable), now I need to get
> >> out from that infiniband
> >> connection more bandwidth.
> >
> > Getting good performance can be tricky with RDMA. The most difficulty I have had (and I have read/heard about) is dealing with memory registrations.
> >
> >>
> >> ib_write_bw claims that my max achievable bandwidth is around 1500
> >> MB/s (I'm not getting
> >> 3000MB/s because my card is installed in a PCI 2.0 8x).
> >>
> >> So far so good. I coded my communication channel using ibverbs and
> >> rdma but I'm getting far
> >> less than the bandwith I can get, I'm even getting a bit less
> >> bandwidth than using socket but
> >> at least my application doesn't use any CPU power:
> >>
> >> ib_write_bw: 1500 MB/s
> >>
> >> sockets: 700 MB/s <= One core of my system is at 100% during this test
> >>
> >> ibvers+rdma: 600 MB/s <= No CPU is used at all during this test
> >>
> >> It seems that the bottleneck is here:
> >>
> >> ibv_sge sge;
> >> sge.addr = (uintptr_t)memory_to_transfer;
> >> sge.length = memory_to_transfer_size;
> >> sge.lkey = memory_to_transfer_mr->lkey;
> >>
> >> ibv_send_wr wr;
> >> memset(&wr, 0, sizeof(wr));
> >> wr.wr_id = 0;
> >> wr.opcode = IBV_WR_RDMA_WRITE;
> >
> > Generally, I have thought that RDMA READ is easier to deal with than RDMA WRITE. As you have found, when you do a RDMA WRITE there is an extra RDMA_SEND step to tell the remote side the write has been completed. If the remote side does a RDMA_READ then they will know the data is available when they see the WC come back on that end. So the only "extra" send/recv required for verbs is the initial transfer of the ETH (addr, size, rkey) information.
>
> How would the "sender side" know that the reading side has done so the
> buffer being read can be overwritten?
Yes that is true but I think the sequence is simpler. Assuming the
registration needs to occur in the loop (ie on some random buffer the user
passed in.)
active side:
loop:
register send buffer
SEND ETH info <== At this point you could actually
loop "sending" more buffers
RECV "got it" mesg <== this could be another thread which is
verifying the reception of all data
unregister buffer
passive side:
loop:
RECV ETH info
register recv buffer (based on ETH recv)
RDMA READ
unregister buffer
SEND "got it" mesg
This is less back and forth messaging since the initial "I have data to send"
message contains the ETH info and the passive side can quickly allocate and
read it and then send a single message back.
But I admit I don't know your exact requirements so this may not be what you
want or need.
>
> >> wr.sg_list = &sge;
> >> wr.num_sge = 1;
> >> wr.send_flags = IBV_SEND_SIGNALED;
> >> wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
> >> wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;
> >>
> >> ibv_send_wr *bad_wr = NULL;
> >> if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) {
> >> notifyError("Unable to ibv post receive");
> >> }
> >>
> >> at this point the next code waiting for completation that is:
> >>
> >> //Wait for completation
> >> ibv_cq *cq;
> >> void* cq_context;
> >> if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) {
> >> notifyError("Unable to get a ibv cq event");
> >> }
> >>
> >> ibv_ack_cq_events(cq, 1);
> >>
> >> if (ibv_req_notify_cq(cq, 0) != 0) {
> >> notifyError("Unable to get a req notify");
> >> }
> >>
> >> ibv_wc wc;
> >> int myRet = ibv_poll_cq(cq, 1, &wc);
> >> if (myRet > 1) {
> >> LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
> >> }
> >>
> >>
> >> The time from my ibv_post_send and when ibv_get_cq_event returns an
> >> event is 13.3ms when transfering chuncks of 8 MB achieving then around 600 MB/s.
> >
> > It looks like you are waiting for a completion before doing another xfer? Is this the case? That may not be the most efficient.
>
> I have to implement using infiniband as transport level the two
> following interfaces:
>
> Sink::write(buffer)
> Source::read(buffer);
>
> Sink::write and Source::read are the last/first blocks of a pipeline
> and the data flow
> potentially never ends.
>
> >>
> >> To specify more (in pseudocode what I do globally):
> >>
> >> Active Side:
> >>
> >> post a message receive
> >> rdma connection
> >> wait for rdma connection event
> >> <<at this point transfer tx flow starts>>
> >> start:
> >> register memory containing bytes to transfer
> >
> > I believe Sean mentioned you should avoid doing memory registration in any areas of the code where performance is critical. I agree with him.
>
> Well I can register / unregister once but that means that each time I
> have to transfer something
> (see my interfaces above) I have to issue a memcpy on the sending side
> and on the receiving side.
> Is a memcopy cheaper than a ibv_reg_mr/ibv_dereg_mr ?
I suspect so for small messages. I have never profiled it but there is much
evidence to this. I wish I could find the paper I read recently regarding
efficient RDMA memory usage, sorry.
>
> >> wait remote memory region addr/key ( I wait for a ibv_wc)
> >> send data with ibv_post_send
> >> post a message receive
> >> wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms)
> >> send message "DONE"
> >> unregister memory
> >
> > This applies to unregistration of memory as well.
> >
> >> goto start
> >>
> >> Passive Side:
> >>
> >> post a message receive
> >> rdma accept
> >> wait for rdma connection event
> >> <<at this point transfer rx flow starts>>
> >> start:
> >> register memory that has to receive the bytes
> >> send addr/key of memory registered
> >> wait "DONE" message
> >> unregister memory
> >> post a message receive
> >> goto start
> >>
> >> Does anyone knows what I'm doing wrong? Or what I can improve? I'm not
> >> affected by
> >> "Not Invented Here" syndrome so I'm even open to throw away what I
> >> have done until
> >> now and adopting something else.
> >>
> >> I only need a point to point contiguous transfer.
> >
> > How big is this transfer?
> >
> > It may be that doing send/recv or write with immediate would work better for you. Also have you seen Sean's rsocket project?
>
> The transfers are around 1MB each time.
I suspect that is big enough that doing the registration should be more
efficient than the memcopy.
One other thing. Do you need to wait for the completion of each buffer before
posting the RDMA WRITE of the next?
Ira
>
> I have seen rsocket project, I have played a bit rstream and what I get is this:
>
> $ ./examples/rstream -s 10.30.3.2 -S all
> name bytes xfers iters total time Gb/sec usec/xfer
> 64_lat 64 1 1m 122m 4.35s 0.24 2.17
> 128_lat 128 1 1m 244m 4.70s 0.44 2.35
> 192_lat 192 1 1m 366m 4.87s 0.63 2.44
> 256_lat 256 1 1m 488m 6.68s 0.61 3.34
> 384_lat 384 1 1m 732m 7.14s 0.86 3.57
> 512_lat 512 1 1m 976m 8.44s 0.97 4.22
> 768_lat 768 1 1m 1.4g 10.35s 1.19 5.18
> 1k_lat 1k 1 100k 195m 1.02s 1.60 5.12
> 1.5k_lat 1.5k 1 100k 292m 1.32s 1.86 6.60
> 2k_lat 2k 1 100k 390m 1.61s 2.03 8.07
> 3k_lat 3k 1 100k 585m 1.87s 2.63 9.36
> 4k_lat 4k 1 100k 781m 2.39s 2.74 11.95
> 6k_lat 6k 1 100k 1.1g 2.83s 3.47 14.15
> 8k_lat 8k 1 100k 1.5g 3.51s 3.73 17.56
> 12k_lat 12k 1 10k 234m 0.44s 4.45 22.09
> 16k_lat 16k 1 10k 312m 0.58s 4.56 28.75
> 24k_lat 24k 1 10k 468m 0.76s 5.14 38.25
> 32k_lat 32k 1 10k 625m 1.02s 5.12 51.21
> 48k_lat 48k 1 10k 937m 1.27s 6.20 63.40
> 64k_lat 64k 1 10k 1.2g 1.93s 5.43 96.63
> 96k_lat 96k 1 10k 1.8g 2.49s 6.33 124.29
> 128k_lat 128k 1 1k 250m 0.30s 7.00 149.89
> 192k_lat 192k 1 1k 375m 0.49s 6.48 242.76
> 256k_lat 256k 1 1k 500m 0.73s 5.75 364.85
> 384k_lat 384k 1 1k 750m 1.10s 5.73 549.16
> 512k_lat 512k 1 1k 1000m 1.51s 5.54 757.02
> 768k_lat 768k 1 1k 1.4g 1.68s 7.48 841.05
> 1m_lat 1m 1 100 200m 0.28s 6.05 1385.61
> 1.5m_lat 1.5m 1 100 300m 0.41s 6.20 2029.05
> 2m_lat 2m 1 100 400m 0.54s 6.27 2675.73
> 3m_lat 3m 1 100 600m 0.55s 9.13 2757.71
> 4m_lat 4m 1 100 800m 1.04s 6.45 5205.38
> 6m_lat 6m 1 100 1.1g 1.56s 6.46 7794.85
> 64_bw 64 1m 1 122m 1.38s 0.74 0.69
> 128_bw 128 1m 1 244m 0.83s 2.46 0.42
> 192_bw 192 1m 1 366m 1.42s 2.16 0.71
> 256_bw 256 1m 1 488m 1.43s 2.87 0.71
> 384_bw 384 1m 1 732m 1.46s 4.21 0.73
> 512_bw 512 1m 1 976m 1.66s 4.94 0.83
> 768_bw 768 1m 1 1.4g 2.35s 5.24 1.17
> 1k_bw 1k 100k 1 195m 0.31s 5.34 1.54
> 1.5k_bw 1.5k 100k 1 292m 0.44s 5.57 2.21
> 2k_bw 2k 100k 1 390m 0.51s 6.41 2.56
> 3k_bw 3k 100k 1 585m 0.86s 5.71 4.30
> 4k_bw 4k 100k 1 781m 1.02s 6.41 5.11
> 6k_bw 6k 100k 1 1.1g 1.53s 6.45 7.63
> 8k_bw 8k 100k 1 1.5g 2.04s 6.42 10.21
> 12k_bw 12k 10k 1 234m 0.30s 6.46 15.22
> 16k_bw 16k 10k 1 312m 0.40s 6.48 20.21
> 24k_bw 24k 10k 1 468m 0.60s 6.55 30.04
> 32k_bw 32k 10k 1 625m 0.81s 6.51 40.27
> 48k_bw 48k 10k 1 937m 1.20s 6.53 60.21
> 64k_bw 64k 10k 1 1.2g 1.60s 6.54 80.16
> 96k_bw 96k 10k 1 1.8g 2.33s 6.75 116.48
> 128k_bw 128k 1k 1 250m 0.32s 6.51 161.03
> 192k_bw 192k 1k 1 375m 0.48s 6.52 241.36
> 256k_bw 256k 1k 1 500m 0.64s 6.51 321.99
> 384k_bw 384k 1k 1 750m 0.78s 8.06 390.40
> 512k_bw 512k 1k 1 1000m 1.29s 6.52 643.09
> 768k_bw 768k 1k 1 1.4g 1.97s 6.38 986.84
> 1m_bw 1m 100 1 200m 0.26s 6.37 1316.86
> 1.5m_bw 1.5m 100 1 300m 0.27s 9.36 1343.65
> 2m_bw 2m 100 1 400m 0.53s 6.36 2638.12
> 3m_bw 3m 100 1 600m 0.80s 6.31 3988.59
> 4m_bw 4m 100 1 800m 1.07s 6.28 5341.27
> 6m_bw 6m 100 1 1.1g 1.00s 10.09 4988.12
>
> So it seems that a good buffer size is at 6MB getting 10.09 Gb/sec
> (1291 MB/sec) and that is quite good.
>
> But performing only that 6m test I get only 6.8Gb/sec:
>
> $ ./examples/rstream -s 10.30.3.2 -S 6291456 -C 100
> name bytes xfers iters total time Gb/sec usec/xfer
> custom 6m 100 1 1.1g 1.48s 6.81 7395.56
>
>
> Sean told me that running "custom" size tests the setting for the
> transfer are different, I did a look
> at the code and for sure with "custom" tests the optimization for bw:
> val = 0;
> rs_setsockopt(rs, SOL_RDMA, RDMA_INLINE, &val, sizeof val);
>
> is not done but even forcing that call the Gb/sec I'm getting for 6m
> transfer is still 6.81 Gb/sec
> (871 MB/sec).
>
> Gaetano
>
>
>
> > Hope this helps,
> > Ira
> >
> >>
> >>
> >> Regards
> >> Gaetano Mendola
> >>
> >>
> >> --
> >> cpp-today.blogspot.com
> >> _______________________________________________
> >> Users mailing list
> >> Users at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
> >
> >
> > --
> > Ira Weiny
> > Member of Technical Staff
> > Lawrence Livermore National Lab
> > 925-423-8008
> > weiny2 at llnl.gov
>
>
>
> --
> cpp-today.blogspot.com
--
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov
More information about the Users
mailing list