[Users] infiniband rdma poor transfer bw

Mon Aug 27 15:19:13 PDT 2012

On Mon, 27 Aug 2012 23:21:35 +0200
Gaetano Mendola <mendola at gmail.com> wrote:

> On Mon, Aug 27, 2012 at 6:47 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > Gaetano,
> >
> > Yes this is the correct list.  Did you also post a similar message to linux-rdma?  I seem to recall a similar thread there.  If so I think Sean gave some good advice and you should follow that.  If that was not you, see my response (from limited experience below.)
> 
> I'll write to linux-rdma as soon I have collected some other data from
> experiments,
> I replied inline:
> 
> > On Fri, 24 Aug 2012 00:51:05 +0200
> > Gaetano Mendola <mendola at gmail.com> wrote:
> >
> >> Hi all,
> >> I'm sorry in advance if this is not the right mailing list for my question.
> >>
> >> In my application I use an infiniband infrastructure to send a stream
> >> of data from a server to
> >> another one. I have used to easy the development ip over infiniband
> >> because I'm more familiar
> >> with socket programming. Until now the performance (max bw) was good
> >> enough for me (I knew
> >> I wasn't getting the maximum bandwith achievable), now I need to get
> >> out from that infiniband
> >> connection more bandwidth.
> >
> > Getting good performance can be tricky with RDMA.  The most difficulty I have had (and I have read/heard about) is dealing with memory registrations.
> >
> >>
> >> ib_write_bw claims that my max achievable bandwidth is around 1500
> >> MB/s (I'm not getting
> >> 3000MB/s because my card is installed in a PCI 2.0 8x).
> >>
> >> So far so good. I coded my communication channel using ibverbs and
> >> rdma but I'm getting far
> >> less than the bandwith I can get, I'm even getting a bit less
> >> bandwidth than using socket but
> >> at least my application doesn't use any CPU power:
> >>
> >> ib_write_bw: 1500 MB/s
> >>
> >> sockets: 700 MB/s <= One core of my system is at 100% during this test
> >>
> >> ibvers+rdma: 600 MB/s <= No CPU is used at all during this test
> >>
> >> It seems that the bottleneck is here:
> >>
> >> ibv_sge sge;
> >> sge.addr = (uintptr_t)memory_to_transfer;
> >> sge.length = memory_to_transfer_size;
> >> sge.lkey = memory_to_transfer_mr->lkey;
> >>
> >> ibv_send_wr wr;
> >> memset(&wr, 0, sizeof(wr));
> >> wr.wr_id = 0;
> >> wr.opcode = IBV_WR_RDMA_WRITE;
> >
> > Generally, I have thought that RDMA READ is easier to deal with than RDMA WRITE.  As you have found, when you do a RDMA WRITE there is an extra RDMA_SEND step to tell the remote side the write has been completed.  If the remote side does a RDMA_READ then they will know the data is available when they see the WC come back on that end.  So the only "extra" send/recv required for verbs is the initial transfer of the ETH (addr, size, rkey) information.
> 
> How would the "sender side" know that the reading side has done so the
> buffer being read can be overwritten?

Yes that is true but I think the sequence is simpler.  Assuming the
registration needs to occur in the loop (ie on some random buffer the user
passed in.)

active side:
        loop:
                register send buffer
                SEND ETH info         <== At this point you could actually
                                          loop "sending" more buffers
                RECV "got it" mesg    <== this could be another thread which is
                                          verifying the reception of all data
                unregister buffer

passive side:
        loop:
               RECV ETH info
               register recv buffer (based on ETH recv)
               RDMA READ
               unregister buffer
               SEND "got it" mesg

This is less back and forth messaging since the initial "I have data to send"
message contains the ETH info and the passive side can quickly allocate and
read it and then send a single message back.

But I admit I don't know your exact requirements so this may not be what you
want or need.

> 
> >> wr.sg_list = &sge;
> >> wr.num_sge = 1;
> >> wr.send_flags = IBV_SEND_SIGNALED;
> >> wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
> >> wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;
> >>
> >> ibv_send_wr *bad_wr = NULL;
> >> if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) {
> >>   notifyError("Unable to ibv post receive");
> >> }
> >>
> >> at this point the next code waiting for completation that is:
> >>
> >> //Wait for completation
> >> ibv_cq *cq;
> >> void* cq_context;
> >> if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) {
> >>   notifyError("Unable to get a ibv cq event");
> >> }
> >>
> >> ibv_ack_cq_events(cq, 1);
> >>
> >> if (ibv_req_notify_cq(cq, 0) != 0) {
> >>   notifyError("Unable to get a req notify");
> >> }
> >>
> >> ibv_wc wc;
> >> int myRet = ibv_poll_cq(cq, 1, &wc);
> >> if (myRet > 1) {
> >>   LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
> >> }
> >>
> >>
> >> The time from my ibv_post_send and when ibv_get_cq_event returns an
> >> event is 13.3ms when transfering chuncks of 8 MB achieving then around 600 MB/s.
> >
> > It looks like you are waiting for a completion before doing another xfer?  Is this the case?  That may not be the most efficient.
> 
> I have to implement using infiniband as transport level the two
> following interfaces:
> 
> Sink::write(buffer)
> Source::read(buffer);
> 
> Sink::write and Source::read are the last/first blocks of a pipeline
> and the data flow
> potentially never ends.
> 
> >>
> >> To specify more (in pseudocode what I do globally):
> >>
> >> Active Side:
> >>
> >> post a message receive
> >> rdma connection
> >> wait for rdma connection event
> >> <<at this point transfer tx flow starts>>
> >> start:
> >> register memory containing bytes to transfer
> >
> > I believe Sean mentioned you should avoid doing memory registration in any areas of the code where performance is critical.  I agree with him.
> 
> Well I can register / unregister once but that means that each time I
> have to transfer something
> (see my interfaces above) I have to issue a memcpy on the sending side
> and on the receiving side.
> Is a memcopy cheaper than a ibv_reg_mr/ibv_dereg_mr ?

I suspect so for small messages.  I have never profiled it but there is much
evidence to this.  I wish I could find the paper I read recently regarding
efficient RDMA memory usage, sorry.

> 
> >> wait remote memory region addr/key ( I wait for a ibv_wc)
> >> send data with ibv_post_send
> >> post a message receive
> >> wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms)
> >> send message "DONE"
> >> unregister memory
> >
> > This applies to unregistration of memory as well.
> >
> >> goto start
> >>
> >> Passive Side:
> >>
> >> post a message receive
> >> rdma accept
> >> wait for rdma connection event
> >> <<at this point transfer rx flow starts>>
> >> start:
> >> register memory that has to receive the bytes
> >> send addr/key of memory registered
> >> wait "DONE" message
> >> unregister memory
> >> post a message receive
> >> goto start
> >>
> >> Does anyone knows what I'm doing wrong? Or what I can improve? I'm not
> >> affected by
> >> "Not Invented Here" syndrome so I'm even open to throw away what I
> >> have done until
> >> now and adopting something else.
> >>
> >> I only need a point to point contiguous transfer.
> >
> > How big is this transfer?
> >
> > It may be that doing send/recv or write with immediate would work better for you.  Also have you seen Sean's rsocket project?
> 
> The transfers are around 1MB each time.

I suspect that is big enough that doing the registration should be more
efficient than the memcopy.

One other thing.  Do you need to wait for the completion of each buffer before
posting the RDMA WRITE of the next?

Ira

> 
> I have seen rsocket project, I have played a bit rstream and what I get is this:
> 
> $ ./examples/rstream -s 10.30.3.2 -S all
> name      bytes   xfers   iters   total       time     Gb/sec    usec/xfer
> 64_lat    64      1       1m      122m        4.35s      0.24       2.17
> 128_lat   128     1       1m      244m        4.70s      0.44       2.35
> 192_lat   192     1       1m      366m        4.87s      0.63       2.44
> 256_lat   256     1       1m      488m        6.68s      0.61       3.34
> 384_lat   384     1       1m      732m        7.14s      0.86       3.57
> 512_lat   512     1       1m      976m        8.44s      0.97       4.22
> 768_lat   768     1       1m      1.4g       10.35s      1.19       5.18
> 1k_lat    1k      1       100k    195m        1.02s      1.60       5.12
> 1.5k_lat  1.5k    1       100k    292m        1.32s      1.86       6.60
> 2k_lat    2k      1       100k    390m        1.61s      2.03       8.07
> 3k_lat    3k      1       100k    585m        1.87s      2.63       9.36
> 4k_lat    4k      1       100k    781m        2.39s      2.74      11.95
> 6k_lat    6k      1       100k    1.1g        2.83s      3.47      14.15
> 8k_lat    8k      1       100k    1.5g        3.51s      3.73      17.56
> 12k_lat   12k     1       10k     234m        0.44s      4.45      22.09
> 16k_lat   16k     1       10k     312m        0.58s      4.56      28.75
> 24k_lat   24k     1       10k     468m        0.76s      5.14      38.25
> 32k_lat   32k     1       10k     625m        1.02s      5.12      51.21
> 48k_lat   48k     1       10k     937m        1.27s      6.20      63.40
> 64k_lat   64k     1       10k     1.2g        1.93s      5.43      96.63
> 96k_lat   96k     1       10k     1.8g        2.49s      6.33     124.29
> 128k_lat  128k    1       1k      250m        0.30s      7.00     149.89
> 192k_lat  192k    1       1k      375m        0.49s      6.48     242.76
> 256k_lat  256k    1       1k      500m        0.73s      5.75     364.85
> 384k_lat  384k    1       1k      750m        1.10s      5.73     549.16
> 512k_lat  512k    1       1k      1000m       1.51s      5.54     757.02
> 768k_lat  768k    1       1k      1.4g        1.68s      7.48     841.05
> 1m_lat    1m      1       100     200m        0.28s      6.05    1385.61
> 1.5m_lat  1.5m    1       100     300m        0.41s      6.20    2029.05
> 2m_lat    2m      1       100     400m        0.54s      6.27    2675.73
> 3m_lat    3m      1       100     600m        0.55s      9.13    2757.71
> 4m_lat    4m      1       100     800m        1.04s      6.45    5205.38
> 6m_lat    6m      1       100     1.1g        1.56s      6.46    7794.85
> 64_bw     64      1m      1       122m        1.38s      0.74       0.69
> 128_bw    128     1m      1       244m        0.83s      2.46       0.42
> 192_bw    192     1m      1       366m        1.42s      2.16       0.71
> 256_bw    256     1m      1       488m        1.43s      2.87       0.71
> 384_bw    384     1m      1       732m        1.46s      4.21       0.73
> 512_bw    512     1m      1       976m        1.66s      4.94       0.83
> 768_bw    768     1m      1       1.4g        2.35s      5.24       1.17
> 1k_bw     1k      100k    1       195m        0.31s      5.34       1.54
> 1.5k_bw   1.5k    100k    1       292m        0.44s      5.57       2.21
> 2k_bw     2k      100k    1       390m        0.51s      6.41       2.56
> 3k_bw     3k      100k    1       585m        0.86s      5.71       4.30
> 4k_bw     4k      100k    1       781m        1.02s      6.41       5.11
> 6k_bw     6k      100k    1       1.1g        1.53s      6.45       7.63
> 8k_bw     8k      100k    1       1.5g        2.04s      6.42      10.21
> 12k_bw    12k     10k     1       234m        0.30s      6.46      15.22
> 16k_bw    16k     10k     1       312m        0.40s      6.48      20.21
> 24k_bw    24k     10k     1       468m        0.60s      6.55      30.04
> 32k_bw    32k     10k     1       625m        0.81s      6.51      40.27
> 48k_bw    48k     10k     1       937m        1.20s      6.53      60.21
> 64k_bw    64k     10k     1       1.2g        1.60s      6.54      80.16
> 96k_bw    96k     10k     1       1.8g        2.33s      6.75     116.48
> 128k_bw   128k    1k      1       250m        0.32s      6.51     161.03
> 192k_bw   192k    1k      1       375m        0.48s      6.52     241.36
> 256k_bw   256k    1k      1       500m        0.64s      6.51     321.99
> 384k_bw   384k    1k      1       750m        0.78s      8.06     390.40
> 512k_bw   512k    1k      1       1000m       1.29s      6.52     643.09
> 768k_bw   768k    1k      1       1.4g        1.97s      6.38     986.84
> 1m_bw     1m      100     1       200m        0.26s      6.37    1316.86
> 1.5m_bw   1.5m    100     1       300m        0.27s      9.36    1343.65
> 2m_bw     2m      100     1       400m        0.53s      6.36    2638.12
> 3m_bw     3m      100     1       600m        0.80s      6.31    3988.59
> 4m_bw     4m      100     1       800m        1.07s      6.28    5341.27
> 6m_bw     6m      100     1       1.1g        1.00s     10.09    4988.12
> 
> So it seems that a good buffer size is at 6MB getting 10.09 Gb/sec
> (1291 MB/sec) and that is quite good.
> 
> But performing only that 6m test I get only 6.8Gb/sec:
> 
> $ ./examples/rstream -s 10.30.3.2 -S 6291456 -C 100
> name      bytes   xfers   iters   total       time     Gb/sec    usec/xfer
> custom    6m      100     1       1.1g        1.48s      6.81    7395.56
> 
> 
> Sean told me that running "custom" size tests the setting for the
> transfer are different, I did a look
> at the code and for sure with "custom" tests the optimization for bw:
>       val = 0;
>       rs_setsockopt(rs, SOL_RDMA, RDMA_INLINE, &val, sizeof val);
> 
> is not done but even forcing that call the Gb/sec I'm getting for 6m
> transfer is still 6.81 Gb/sec
> (871 MB/sec).
> 
> Gaetano
> 
> 
> 
> > Hope this helps,
> > Ira
> >
> >>
> >>
> >> Regards
> >> Gaetano Mendola
> >>
> >>
> >> --
> >> cpp-today.blogspot.com
> >> _______________________________________________
> >> Users mailing list
> >> Users at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
> >
> >
> > --
> > Ira Weiny
> > Member of Technical Staff
> > Lawrence Livermore National Lab
> > 925-423-8008
> > weiny2 at llnl.gov
> 
> 
> 
> -- 
> cpp-today.blogspot.com

-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov