[Users] infiniband rdma poor transfer bw

Tue Aug 28 07:48:57 PDT 2012

Thank you for your ideas, my replies inline below:

On Tue, Aug 28, 2012 at 2:22 PM, David McMillen
<davem at systemfabricworks.com> wrote:
>
> I have added some other ideas inline below:
>
> On Mon, Aug 27, 2012 at 5:19 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
>>
>> On Mon, 27 Aug 2012 23:21:35 +0200
>> Gaetano Mendola <mendola at gmail.com> wrote:
>>
>> > On Mon, Aug 27, 2012 at 6:47 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
>> > > Gaetano,
>> > >
>> > > Yes this is the correct list.  Did you also post a similar message to
>> > > linux-rdma?  I seem to recall a similar thread there.  If so I think Sean
>> > > gave some good advice and you should follow that.  If that was not you, see
>> > > my response (from limited experience below.)
>> >
>> > I'll write to linux-rdma as soon I have collected some other data from
>> > experiments,
>> > I replied inline:
>> >
>> > > On Fri, 24 Aug 2012 00:51:05 +0200
>> > > Gaetano Mendola <mendola at gmail.com> wrote:
>> > >
>> > >> Hi all,
>> > >> I'm sorry in advance if this is not the right mailing list for my
>> > >> question.
>> > >>
>> > >> In my application I use an infiniband infrastructure to send a stream
>> > >> of data from a server to
>> > >> another one. I have used to easy the development ip over infiniband
>> > >> because I'm more familiar
>> > >> with socket programming. Until now the performance (max bw) was good
>> > >> enough for me (I knew
>> > >> I wasn't getting the maximum bandwith achievable), now I need to get
>> > >> out from that infiniband
>> > >> connection more bandwidth.
>> > >
>> > > Getting good performance can be tricky with RDMA.  The most difficulty
>> > > I have had (and I have read/heard about) is dealing with memory
>> > > registrations.
>> > >
>> > >>
>> > >> ib_write_bw claims that my max achievable bandwidth is around 1500
>> > >> MB/s (I'm not getting
>> > >> 3000MB/s because my card is installed in a PCI 2.0 8x).
>
>
> This maximum of 1500 just seems wrong, although I don't know what your
> hardware is.  In my experience, PCIe 2.0 8x will run close to 3000, with
> observed numbers as high as 3200 and as low as 2500.  I generally expect at
> least 2800 for an aggressive application like ib_write_bw.  Here are some
> common problems you might look for:

Indeed my Slop is a PCI-E 2.0 x4 that's why the 1500 MB/sec was what
I'm expecting.

>    1) The 1500 number is what I would expect from using a PCIe slot that was
> physically able to accept an 8x card, but only implemented 4x for the
> connections.  You should check the documentation for the motherboard to see
> if that is what is happening, as it is common for many motherboards to have
> a slot like this.  You can also look at the output of "lspci -vv" where you
> will see a line with something like "LnkCap:" showing the width the device
> is capable of using and another line with something like "LnkSta:" showing
> the width the device is actually using.  If this is happening and you have a
> true 8x slot available, you should move the card.  Note that this problem
> could be on either system and it would slow down both.

Unfortunately my system has only 4 slots x16 and those are already used by 4 GPU
boards. I know I'm not getting the maximum bw from that card but it's
fine for my application.

This what lspci say about that slot:

01:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
2.0 5GT/s - IB QDR / 10GigE] (rev b0)
        Subsystem: Mellanox Technologies Device 0036
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 256 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at fbb00000 (64-bit, non-prefetchable) [size=1M]
        Region 2: Memory at f1800000 (64-bit, prefetchable) [size=8M]
        Capabilities: <access denied>
        Kernel driver in use: mlx4_core
        Kernel modules: mlx4_core

>   2) Your system may have NUMA memory issues.  Look at the output of
> "numactl --hardware" and see how many nodes are available (first line).  If
> there is more than 1 available node, you may be falling victim to the
> internal movement of the data across NUMA nodes.  This usually shows up as
> inconsistent runs, which you have observed with the rsocket tests, so there
> may be something to this.  I have seen systems with ib_write_bw test results
> that reach 3000 MB/s when positioned on the best NUMA node, and then as low
> as 1200 MB/s when running on the worst NUMA node.  You can investigate this
> further by doing ib_write_bw tests using the numactl command to force a
> particular NUMA node to be used.  Assuming problems may exist on both ends
> of the link, you need to run the test with "numactl --membind=0
> --cpunodebind=0 ib_write_bw -a" through "numactl --membind=N --cpunodebind=N
> ib_write_bw -a" on the server side (N being the largest node available).
> For each of the NUMA nodes on the server, you would then run the client
> using  "numactl --membind=0 --cpunodebind=0 ib_write_bw -a serverip" through
> "numactl --membind=N --cpunodebind=N ib_write_bw -a serverip" for all NUMA
> nodes on the client system.  It will be clear which NUMA node(s) are giving
> you the best throughput.  If your application can fit within the memory and
> cpu constraints of those NUMA nodes, you can simply run your application
> under the same constraints (the node specified can be a list of nodes if
> more than one gives good results).

My system is a NUMA system with 2 nodes and the Ib board is attached to Node0:

$ ./lstopo
Machine (24GB)
  NUMANode L#0 (P#0 12GB)
    Socket L#0 + L3 L#0 (12MB)
      L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
    HostBridge L#0
      PCIBridge
        PCI 15b3:673c
          Net L#0 "ib0"
          Net L#1 "ib1"
          OpenFabrics L#2 "mlx4_0"
      PCIBridge
        PCI 10de:06d1
      PCIBridge
        PCI 10de:06d1
      PCIBridge
        PCI 8086:10d3
          Net L#3 "eth0"
      PCIBridge
        PCI 8086:10d3
          Net L#4 "eth1"
      PCIBridge
        PCI 102b:0532
      PCI 8086:3a22
        Block L#5 "sda"
        Block L#6 "sdb"
        Block L#7 "sdc"
        Block L#8 "sdd"
        Block L#9 "sr0"
  NUMANode L#1 (P#1 12GB)
    Socket L#1 + L3 L#1 (12MB)
      L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)
    HostBridge L#7
      PCIBridge
        PCI 10de:06d1
      PCIBridge
        PCI 10de:06d1

(both server and client have the same hardware) I did try to run all
the four combination
of ib_write_bw running on both nodes on server and client but nothing
changes, it still
claims 1500 MB/sec no matter on which node it runs.

>   3) Perhaps your link is running at DDR speed instead of QDR speed,
> although even with DDR I would expect a number above 1900 MB/s.  Look at the
> output of "ibstatus" on both the server and the client.  If there are switch
> links involved you should look at them as well -- "ibnetdiscover --ports"
> shows link width and speed, but you have to find the links in use in that
> output.

The link is a QDR but as I wrote above the PCI slot is a 4x

$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c903:004e:2959
        base lid:        0x4
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0002:c903:004e:295a
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      2: Polling
        rate:            70 Gb/sec (4X)

kalman at essa-prototype2:~/hwloc-1.4.2/utils$

>>
>> > >>
>> > >> So far so good. I coded my communication channel using ibverbs and
>> > >> rdma but I'm getting far
>> > >> less than the bandwith I can get, I'm even getting a bit less
>> > >> bandwidth than using socket but
>> > >> at least my application doesn't use any CPU power:
>> > >>
>> > >> ib_write_bw: 1500 MB/s
>> > >>
>> > >> sockets: 700 MB/s <= One core of my system is at 100% during this
>> > >> test
>> > >>
>> > >> ibvers+rdma: 600 MB/s <= No CPU is used at all during this test
>> > >>
>> > >> It seems that the bottleneck is here:
>> > >>
>> > >> ibv_sge sge;
>> > >> sge.addr = (uintptr_t)memory_to_transfer;
>> > >> sge.length = memory_to_transfer_size;
>> > >> sge.lkey = memory_to_transfer_mr->lkey;
>> > >>
>> > >> ibv_send_wr wr;
>> > >> memset(&wr, 0, sizeof(wr));
>> > >> wr.wr_id = 0;
>> > >> wr.opcode = IBV_WR_RDMA_WRITE;
>> > >
>> > > Generally, I have thought that RDMA READ is easier to deal with than
>> > > RDMA WRITE.  As you have found, when you do a RDMA WRITE there is an extra
>> > > RDMA_SEND step to tell the remote side the write has been completed.  If the
>> > > remote side does a RDMA_READ then they will know the data is available when
>> > > they see the WC come back on that end.  So the only "extra" send/recv
>> > > required for verbs is the initial transfer of the ETH (addr, size, rkey)
>> > > information.
>> >
>> > How would the "sender side" know that the reading side has done so the
>> > buffer being read can be overwritten?
>
>
> Another choice is to use IBV_WR_RDMA_WRITE_WITH_IMM which will create a
> completion for the recipient of the data.  However, in my experience there
> is the need for some kind of flow control (ready/done) messages to be sent
> in both directions using IBV_WR_SEND anyway, as Ira suggests.  It isn't so
> much the use of RDMA_READ versus RDMA_WRITE as it is a concept of the client
> saying "server, go do this transaction" and the server responding with
> "transaction done".  For the highest speed operations, you need to set it up
> so the client can request multiple transactions (at least two, and if disk
> transfers are involved it should be at least 0.25 seconds worth, ideally a
> whole second) before seeing a completion from the server.

I'll try the IBV_WR_RDMA_WRITE_WITH_IMM avoid the send "DONE" message,
as I understood with IBV_WR_RDMA_WRITE_WITH_IMM is notified, right ?

>>
>> Yes that is true but I think the sequence is simpler.  Assuming the
>> registration needs to occur in the loop (ie on some random buffer the user
>> passed in.)
>>
>> active side:
>>         loop:
>>                 register send buffer
>>                 SEND ETH info         <== At this point you could actually
>>                                           loop "sending" more buffers
>>                 RECV "got it" mesg    <== this could be another thread
>> which is
>>                                           verifying the reception of all
>> data
>>                 unregister buffer
>>
>> passive side:
>>         loop:
>>                RECV ETH info
>>                register recv buffer (based on ETH recv)
>>                RDMA READ
>>                unregister buffer
>>                SEND "got it" mesg
>>
>>
>> This is less back and forth messaging since the initial "I have data to
>> send"
>> message contains the ETH info and the passive side can quickly allocate
>> and
>> read it and then send a single message back.
>>
>> But I admit I don't know your exact requirements so this may not be what
>> you
>> want or need.
>>
>> >
>> > >> wr.sg_list = &sge;
>> > >> wr.num_sge = 1;
>> > >> wr.send_flags = IBV_SEND_SIGNALED;
>> > >> wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
>> > >> wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;
>> > >>
>> > >> ibv_send_wr *bad_wr = NULL;
>> > >> if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0)
>> > >> {
>> > >>   notifyError("Unable to ibv post receive");
>> > >> }
>> > >>
>> > >> at this point the next code waiting for completation that is:
>> > >>
>> > >> //Wait for completation
>> > >> ibv_cq *cq;
>> > >> void* cq_context;
>> > >> if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) !=
>> > >> 0) {
>> > >>   notifyError("Unable to get a ibv cq event");
>> > >> }
>> > >>
>> > >> ibv_ack_cq_events(cq, 1);
>> > >>
>> > >> if (ibv_req_notify_cq(cq, 0) != 0) {
>> > >>   notifyError("Unable to get a req notify");
>> > >> }
>> > >>
>> > >> ibv_wc wc;
>> > >> int myRet = ibv_poll_cq(cq, 1, &wc);
>> > >> if (myRet > 1) {
>> > >>   LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
>> > >> }
>> > >>
>> > >>
>> > >> The time from my ibv_post_send and when ibv_get_cq_event returns an
>> > >> event is 13.3ms when transfering chuncks of 8 MB achieving then
>> > >> around 600 MB/s.
>> > >
>> > > It looks like you are waiting for a completion before doing another
>> > > xfer?  Is this the case?  That may not be the most efficient.
>> >
>> > I have to implement using infiniband as transport level the two
>> > following interfaces:
>> >
>> > Sink::write(buffer)
>> > Source::read(buffer);
>> >
>> > Sink::write and Source::read are the last/first blocks of a pipeline
>> > and the data flow
>> > potentially never ends.
>> >
>
>
> Not knowing your whole application puts us at a disadvantage, but I am
> guessing that the server at the other end of the Infiniband is the largest
> potential source of variable performance.  Your incoming data probably comes
> at a somewhat steady rate, and the processing done on that data collection
> node (client side?) is probably running at a steady rate as well.  Your
> server has to deal with a highly variable speed device like a disk drive,
> and the Infiniband communications can potentially suffer from interference
> with other traffic.  At the risk of repeating myself and what others have
> said, you need to use multiple buffers for sending the data so you can
> tolerate this variability.

Basicaly server and client perform a single connection and then that connection
is used to transmit a continuous data flow (point to point connection).
TX side is a last component of a pipeline and the RX side is the first component
of another pipeline on another physical server.
And the data flow is a continuous data flow with throttled rate.

>>
>> > >>
>> > >> To specify more (in pseudocode what I do globally):
>> > >>
>> > >> Active Side:
>> > >>
>> > >> post a message receive
>> > >> rdma connection
>> > >> wait for rdma connection event
>> > >> <<at this point transfer tx flow starts>>
>> > >> start:
>> > >> register memory containing bytes to transfer
>> > >
>> > > I believe Sean mentioned you should avoid doing memory registration in
>> > > any areas of the code where performance is critical.  I agree with him.
>> >
>> > Well I can register / unregister once but that means that each time I
>> > have to transfer something
>> > (see my interfaces above) I have to issue a memcpy on the sending side
>> > and on the receiving side.
>> > Is a memcopy cheaper than a ibv_reg_mr/ibv_dereg_mr ?
>>
>> I suspect so for small messages.  I have never profiled it but there is
>> much
>> evidence to this.  I wish I could find the paper I read recently regarding
>> efficient RDMA memory usage, sorry.
>
>
> It is a complicated subject.  You can benchmark the
> memcpy()/memmove()bcopy() functions to see exactly what your processor does
> (and which one works best).  Modern processors easily move over 10GB/sec
> when things are aligned and in the right place, but this can be highly
> variable depending on system architecture.  The MPI people probably have
> done the most work in this area, and papers about this can be found on their
> websites.  If I read the original post properly, it seems like transfers are
> around 8MB, and I would be inclined to just do RDMA from buffers like that.
> I think I see 1MB indicated below, and I still would be inclined to do RDMA
> and avoid possible complications of the memory copy needing to happen in a
> different cpu core.
>
> I am a little unclear about when rdma connections happen in this
> application.  Reading the post, it seems like this is happening for each
> transfer.  There is a lot of overhead setting up a connection and tearing it
> down, so I hope I did not read that correctly.  Otherwise, you will see a
> significant improvement if you keep track of the connection and only make a
> connection when there is not one, and only remove the connection when it
> fails.

The connection is persistent.

> The creation and destruction of memory regions is an expensive operation.
> It cannot be done with the OS bypass, but instead the verbs library makes a
> request to the verbs driver, which contacts the HCA driver, and then sets up
> (or destroys) the memory region.  The OS bypass allows millions of SEND or
> RDMA_* operations per second, while the memory region requests only run at
> thousands per second.  Also, remember that a memory region involves locking
> the region's pages in memory, which can be a lengthy process in the
> operating system.
>
> One important optimization is that protection domains are associated with
> HCAs, and memory regions are associated with protection domains.  This means
> that you don't need a queue pair or connection to manipulate them.  If you
> can tolerate large amounts of memory locked down, which is common in these
> kinds of applicaitons, you should just create a memory region that
> encompasses all of the memory you will be using for your various buffers.  A
> more complicated version of this would be to create a memory region for each
> allocation of memory, and then to look up which memory region is associated
> with a specific buffer.  I suspect that rsocket code does something like
> this.

Yes that's an idea, I have to be sure (as already is the case) the
buffers are not
continuously allocated/deallocated.
I'll try to create an hash table buffer -> memory region to avoid those
registration/deregistration and I'll post what I get.

> Regards,
>     Dave McMillen
>
>>
>>
>> >
>> > >> wait remote memory region addr/key ( I wait for a ibv_wc)
>> > >> send data with ibv_post_send
>> > >> post a message receive
>> > >> wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3
>> > >> ms)
>> > >> send message "DONE"
>> > >> unregister memory
>> > >
>> > > This applies to unregistration of memory as well.
>> > >
>> > >> goto start
>> > >>
>> > >> Passive Side:
>> > >>
>> > >> post a message receive
>> > >> rdma accept
>> > >> wait for rdma connection event
>> > >> <<at this point transfer rx flow starts>>
>> > >> start:
>> > >> register memory that has to receive the bytes
>> > >> send addr/key of memory registered
>> > >> wait "DONE" message
>> > >> unregister memory
>> > >> post a message receive
>> > >> goto start
>> > >>
>> > >> Does anyone knows what I'm doing wrong? Or what I can improve? I'm
>> > >> not
>> > >> affected by
>> > >> "Not Invented Here" syndrome so I'm even open to throw away what I
>> > >> have done until
>> > >> now and adopting something else.
>> > >>
>> > >> I only need a point to point contiguous transfer.
>> > >
>> > > How big is this transfer?
>> > >
>> > > It may be that doing send/recv or write with immediate would work
>> > > better for you.  Also have you seen Sean's rsocket project?
>> >
>> > The transfers are around 1MB each time.
>>
>> I suspect that is big enough that doing the registration should be more
>> efficient than the memcopy.
>>
>> One other thing.  Do you need to wait for the completion of each buffer
>> before
>> posting the RDMA WRITE of the next?
>>
>> Ira
>>
>> >
>> > I have seen rsocket project, I have played a bit rstream and what I get
>> > is this:
>> >
>> > $ ./examples/rstream -s 10.30.3.2 -S all
>> > name      bytes   xfers   iters   total       time     Gb/sec
>> > usec/xfer
>> > 64_lat    64      1       1m      122m        4.35s      0.24       2.17
>> > 128_lat   128     1       1m      244m        4.70s      0.44       2.35
>> > 192_lat   192     1       1m      366m        4.87s      0.63       2.44
>> > 256_lat   256     1       1m      488m        6.68s      0.61       3.34
>> > 384_lat   384     1       1m      732m        7.14s      0.86       3.57
>> > 512_lat   512     1       1m      976m        8.44s      0.97       4.22
>> > 768_lat   768     1       1m      1.4g       10.35s      1.19       5.18
>> > 1k_lat    1k      1       100k    195m        1.02s      1.60       5.12
>> > 1.5k_lat  1.5k    1       100k    292m        1.32s      1.86       6.60
>> > 2k_lat    2k      1       100k    390m        1.61s      2.03       8.07
>> > 3k_lat    3k      1       100k    585m        1.87s      2.63       9.36
>> > 4k_lat    4k      1       100k    781m        2.39s      2.74      11.95
>> > 6k_lat    6k      1       100k    1.1g        2.83s      3.47      14.15
>> > 8k_lat    8k      1       100k    1.5g        3.51s      3.73      17.56
>> > 12k_lat   12k     1       10k     234m        0.44s      4.45      22.09
>> > 16k_lat   16k     1       10k     312m        0.58s      4.56      28.75
>> > 24k_lat   24k     1       10k     468m        0.76s      5.14      38.25
>> > 32k_lat   32k     1       10k     625m        1.02s      5.12      51.21
>> > 48k_lat   48k     1       10k     937m        1.27s      6.20      63.40
>> > 64k_lat   64k     1       10k     1.2g        1.93s      5.43      96.63
>> > 96k_lat   96k     1       10k     1.8g        2.49s      6.33     124.29
>> > 128k_lat  128k    1       1k      250m        0.30s      7.00     149.89
>> > 192k_lat  192k    1       1k      375m        0.49s      6.48     242.76
>> > 256k_lat  256k    1       1k      500m        0.73s      5.75     364.85
>> > 384k_lat  384k    1       1k      750m        1.10s      5.73     549.16
>> > 512k_lat  512k    1       1k      1000m       1.51s      5.54     757.02
>> > 768k_lat  768k    1       1k      1.4g        1.68s      7.48     841.05
>> > 1m_lat    1m      1       100     200m        0.28s      6.05    1385.61
>> > 1.5m_lat  1.5m    1       100     300m        0.41s      6.20    2029.05
>> > 2m_lat    2m      1       100     400m        0.54s      6.27    2675.73
>> > 3m_lat    3m      1       100     600m        0.55s      9.13    2757.71
>> > 4m_lat    4m      1       100     800m        1.04s      6.45    5205.38
>> > 6m_lat    6m      1       100     1.1g        1.56s      6.46    7794.85
>> > 64_bw     64      1m      1       122m        1.38s      0.74       0.69
>> > 128_bw    128     1m      1       244m        0.83s      2.46       0.42
>> > 192_bw    192     1m      1       366m        1.42s      2.16       0.71
>> > 256_bw    256     1m      1       488m        1.43s      2.87       0.71
>> > 384_bw    384     1m      1       732m        1.46s      4.21       0.73
>> > 512_bw    512     1m      1       976m        1.66s      4.94       0.83
>> > 768_bw    768     1m      1       1.4g        2.35s      5.24       1.17
>> > 1k_bw     1k      100k    1       195m        0.31s      5.34       1.54
>> > 1.5k_bw   1.5k    100k    1       292m        0.44s      5.57       2.21
>> > 2k_bw     2k      100k    1       390m        0.51s      6.41       2.56
>> > 3k_bw     3k      100k    1       585m        0.86s      5.71       4.30
>> > 4k_bw     4k      100k    1       781m        1.02s      6.41       5.11
>> > 6k_bw     6k      100k    1       1.1g        1.53s      6.45       7.63
>> > 8k_bw     8k      100k    1       1.5g        2.04s      6.42      10.21
>> > 12k_bw    12k     10k     1       234m        0.30s      6.46      15.22
>> > 16k_bw    16k     10k     1       312m        0.40s      6.48      20.21
>> > 24k_bw    24k     10k     1       468m        0.60s      6.55      30.04
>> > 32k_bw    32k     10k     1       625m        0.81s      6.51      40.27
>> > 48k_bw    48k     10k     1       937m        1.20s      6.53      60.21
>> > 64k_bw    64k     10k     1       1.2g        1.60s      6.54      80.16
>> > 96k_bw    96k     10k     1       1.8g        2.33s      6.75     116.48
>> > 128k_bw   128k    1k      1       250m        0.32s      6.51     161.03
>> > 192k_bw   192k    1k      1       375m        0.48s      6.52     241.36
>> > 256k_bw   256k    1k      1       500m        0.64s      6.51     321.99
>> > 384k_bw   384k    1k      1       750m        0.78s      8.06     390.40
>> > 512k_bw   512k    1k      1       1000m       1.29s      6.52     643.09
>> > 768k_bw   768k    1k      1       1.4g        1.97s      6.38     986.84
>> > 1m_bw     1m      100     1       200m        0.26s      6.37    1316.86
>> > 1.5m_bw   1.5m    100     1       300m        0.27s      9.36    1343.65
>> > 2m_bw     2m      100     1       400m        0.53s      6.36    2638.12
>> > 3m_bw     3m      100     1       600m        0.80s      6.31    3988.59
>> > 4m_bw     4m      100     1       800m        1.07s      6.28    5341.27
>> > 6m_bw     6m      100     1       1.1g        1.00s     10.09    4988.12
>> >
>> > So it seems that a good buffer size is at 6MB getting 10.09 Gb/sec
>> > (1291 MB/sec) and that is quite good.
>> >
>> > But performing only that 6m test I get only 6.8Gb/sec:
>> >
>> > $ ./examples/rstream -s 10.30.3.2 -S 6291456 -C 100
>> > name      bytes   xfers   iters   total       time     Gb/sec
>> > usec/xfer
>> > custom    6m      100     1       1.1g        1.48s      6.81    7395.56
>> >
>> >
>> > Sean told me that running "custom" size tests the setting for the
>> > transfer are different, I did a look
>> > at the code and for sure with "custom" tests the optimization for bw:
>> >       val = 0;
>> >       rs_setsockopt(rs, SOL_RDMA, RDMA_INLINE, &val, sizeof val);
>> >
>> > is not done but even forcing that call the Gb/sec I'm getting for 6m
>> > transfer is still 6.81 Gb/sec
>> > (871 MB/sec).
>> >
>> > Gaetano
>> >
>> >
>> >
>> > > Hope this helps,
>> > > Ira
>> > >
>> > >>
>> > >>
>> > >> Regards
>> > >> Gaetano Mendola
>> > >>
>> > >>
>> > >> --
>> > >> cpp-today.blogspot.com
>> > >> _______________________________________________
>> > >> Users mailing list
>> > >> Users at lists.openfabrics.org
>> > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>> > >
>> > >
>> > > --
>> > > Ira Weiny
>> > > Member of Technical Staff
>> > > Lawrence Livermore National Lab
>> > > 925-423-8008
>> > > weiny2 at llnl.gov
>> >
>> >
>> >
>> > --
>> > cpp-today.blogspot.com
>>
>>
>> --
>> Ira Weiny
>> Member of Technical Staff
>> Lawrence Livermore National Lab
>> 925-423-8008
>> weiny2 at llnl.gov
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>
>

-- 
cpp-today.blogspot.com