[Users] infiniband rdma poor transfer bw
Gaetano Mendola
mendola at gmail.com
Tue Aug 28 07:48:57 PDT 2012
Thank you for your ideas, my replies inline below:
On Tue, Aug 28, 2012 at 2:22 PM, David McMillen
<davem at systemfabricworks.com> wrote:
>
> I have added some other ideas inline below:
>
> On Mon, Aug 27, 2012 at 5:19 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
>>
>> On Mon, 27 Aug 2012 23:21:35 +0200
>> Gaetano Mendola <mendola at gmail.com> wrote:
>>
>> > On Mon, Aug 27, 2012 at 6:47 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
>> > > Gaetano,
>> > >
>> > > Yes this is the correct list. Did you also post a similar message to
>> > > linux-rdma? I seem to recall a similar thread there. If so I think Sean
>> > > gave some good advice and you should follow that. If that was not you, see
>> > > my response (from limited experience below.)
>> >
>> > I'll write to linux-rdma as soon I have collected some other data from
>> > experiments,
>> > I replied inline:
>> >
>> > > On Fri, 24 Aug 2012 00:51:05 +0200
>> > > Gaetano Mendola <mendola at gmail.com> wrote:
>> > >
>> > >> Hi all,
>> > >> I'm sorry in advance if this is not the right mailing list for my
>> > >> question.
>> > >>
>> > >> In my application I use an infiniband infrastructure to send a stream
>> > >> of data from a server to
>> > >> another one. I have used to easy the development ip over infiniband
>> > >> because I'm more familiar
>> > >> with socket programming. Until now the performance (max bw) was good
>> > >> enough for me (I knew
>> > >> I wasn't getting the maximum bandwith achievable), now I need to get
>> > >> out from that infiniband
>> > >> connection more bandwidth.
>> > >
>> > > Getting good performance can be tricky with RDMA. The most difficulty
>> > > I have had (and I have read/heard about) is dealing with memory
>> > > registrations.
>> > >
>> > >>
>> > >> ib_write_bw claims that my max achievable bandwidth is around 1500
>> > >> MB/s (I'm not getting
>> > >> 3000MB/s because my card is installed in a PCI 2.0 8x).
>
>
> This maximum of 1500 just seems wrong, although I don't know what your
> hardware is. In my experience, PCIe 2.0 8x will run close to 3000, with
> observed numbers as high as 3200 and as low as 2500. I generally expect at
> least 2800 for an aggressive application like ib_write_bw. Here are some
> common problems you might look for:
Indeed my Slop is a PCI-E 2.0 x4 that's why the 1500 MB/sec was what
I'm expecting.
> 1) The 1500 number is what I would expect from using a PCIe slot that was
> physically able to accept an 8x card, but only implemented 4x for the
> connections. You should check the documentation for the motherboard to see
> if that is what is happening, as it is common for many motherboards to have
> a slot like this. You can also look at the output of "lspci -vv" where you
> will see a line with something like "LnkCap:" showing the width the device
> is capable of using and another line with something like "LnkSta:" showing
> the width the device is actually using. If this is happening and you have a
> true 8x slot available, you should move the card. Note that this problem
> could be on either system and it would slow down both.
Unfortunately my system has only 4 slots x16 and those are already used by 4 GPU
boards. I know I'm not getting the maximum bw from that card but it's
fine for my application.
This what lspci say about that slot:
01:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
2.0 5GT/s - IB QDR / 10GigE] (rev b0)
Subsystem: Mellanox Technologies Device 0036
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin A routed to IRQ 16
Region 0: Memory at fbb00000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at f1800000 (64-bit, prefetchable) [size=8M]
Capabilities: <access denied>
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core
> 2) Your system may have NUMA memory issues. Look at the output of
> "numactl --hardware" and see how many nodes are available (first line). If
> there is more than 1 available node, you may be falling victim to the
> internal movement of the data across NUMA nodes. This usually shows up as
> inconsistent runs, which you have observed with the rsocket tests, so there
> may be something to this. I have seen systems with ib_write_bw test results
> that reach 3000 MB/s when positioned on the best NUMA node, and then as low
> as 1200 MB/s when running on the worst NUMA node. You can investigate this
> further by doing ib_write_bw tests using the numactl command to force a
> particular NUMA node to be used. Assuming problems may exist on both ends
> of the link, you need to run the test with "numactl --membind=0
> --cpunodebind=0 ib_write_bw -a" through "numactl --membind=N --cpunodebind=N
> ib_write_bw -a" on the server side (N being the largest node available).
> For each of the NUMA nodes on the server, you would then run the client
> using "numactl --membind=0 --cpunodebind=0 ib_write_bw -a serverip" through
> "numactl --membind=N --cpunodebind=N ib_write_bw -a serverip" for all NUMA
> nodes on the client system. It will be clear which NUMA node(s) are giving
> you the best throughput. If your application can fit within the memory and
> cpu constraints of those NUMA nodes, you can simply run your application
> under the same constraints (the node specified can be a list of nodes if
> more than one gives good results).
My system is a NUMA system with 2 nodes and the Ib board is attached to Node0:
$ ./lstopo
Machine (24GB)
NUMANode L#0 (P#0 12GB)
Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
HostBridge L#0
PCIBridge
PCI 15b3:673c
Net L#0 "ib0"
Net L#1 "ib1"
OpenFabrics L#2 "mlx4_0"
PCIBridge
PCI 10de:06d1
PCIBridge
PCI 10de:06d1
PCIBridge
PCI 8086:10d3
Net L#3 "eth0"
PCIBridge
PCI 8086:10d3
Net L#4 "eth1"
PCIBridge
PCI 102b:0532
PCI 8086:3a22
Block L#5 "sda"
Block L#6 "sdb"
Block L#7 "sdc"
Block L#8 "sdd"
Block L#9 "sr0"
NUMANode L#1 (P#1 12GB)
Socket L#1 + L3 L#1 (12MB)
L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)
HostBridge L#7
PCIBridge
PCI 10de:06d1
PCIBridge
PCI 10de:06d1
(both server and client have the same hardware) I did try to run all
the four combination
of ib_write_bw running on both nodes on server and client but nothing
changes, it still
claims 1500 MB/sec no matter on which node it runs.
> 3) Perhaps your link is running at DDR speed instead of QDR speed,
> although even with DDR I would expect a number above 1900 MB/s. Look at the
> output of "ibstatus" on both the server and the client. If there are switch
> links involved you should look at them as well -- "ibnetdiscover --ports"
> shows link width and speed, but you have to find the links in use in that
> output.
The link is a QDR but as I wrote above the PCI slot is a 4x
$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c903:004e:2959
base lid: 0x4
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:0002:c903:004e:295a
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 70 Gb/sec (4X)
kalman at essa-prototype2:~/hwloc-1.4.2/utils$
>>
>> > >>
>> > >> So far so good. I coded my communication channel using ibverbs and
>> > >> rdma but I'm getting far
>> > >> less than the bandwith I can get, I'm even getting a bit less
>> > >> bandwidth than using socket but
>> > >> at least my application doesn't use any CPU power:
>> > >>
>> > >> ib_write_bw: 1500 MB/s
>> > >>
>> > >> sockets: 700 MB/s <= One core of my system is at 100% during this
>> > >> test
>> > >>
>> > >> ibvers+rdma: 600 MB/s <= No CPU is used at all during this test
>> > >>
>> > >> It seems that the bottleneck is here:
>> > >>
>> > >> ibv_sge sge;
>> > >> sge.addr = (uintptr_t)memory_to_transfer;
>> > >> sge.length = memory_to_transfer_size;
>> > >> sge.lkey = memory_to_transfer_mr->lkey;
>> > >>
>> > >> ibv_send_wr wr;
>> > >> memset(&wr, 0, sizeof(wr));
>> > >> wr.wr_id = 0;
>> > >> wr.opcode = IBV_WR_RDMA_WRITE;
>> > >
>> > > Generally, I have thought that RDMA READ is easier to deal with than
>> > > RDMA WRITE. As you have found, when you do a RDMA WRITE there is an extra
>> > > RDMA_SEND step to tell the remote side the write has been completed. If the
>> > > remote side does a RDMA_READ then they will know the data is available when
>> > > they see the WC come back on that end. So the only "extra" send/recv
>> > > required for verbs is the initial transfer of the ETH (addr, size, rkey)
>> > > information.
>> >
>> > How would the "sender side" know that the reading side has done so the
>> > buffer being read can be overwritten?
>
>
> Another choice is to use IBV_WR_RDMA_WRITE_WITH_IMM which will create a
> completion for the recipient of the data. However, in my experience there
> is the need for some kind of flow control (ready/done) messages to be sent
> in both directions using IBV_WR_SEND anyway, as Ira suggests. It isn't so
> much the use of RDMA_READ versus RDMA_WRITE as it is a concept of the client
> saying "server, go do this transaction" and the server responding with
> "transaction done". For the highest speed operations, you need to set it up
> so the client can request multiple transactions (at least two, and if disk
> transfers are involved it should be at least 0.25 seconds worth, ideally a
> whole second) before seeing a completion from the server.
I'll try the IBV_WR_RDMA_WRITE_WITH_IMM avoid the send "DONE" message,
as I understood with IBV_WR_RDMA_WRITE_WITH_IMM is notified, right ?
>>
>> Yes that is true but I think the sequence is simpler. Assuming the
>> registration needs to occur in the loop (ie on some random buffer the user
>> passed in.)
>>
>> active side:
>> loop:
>> register send buffer
>> SEND ETH info <== At this point you could actually
>> loop "sending" more buffers
>> RECV "got it" mesg <== this could be another thread
>> which is
>> verifying the reception of all
>> data
>> unregister buffer
>>
>> passive side:
>> loop:
>> RECV ETH info
>> register recv buffer (based on ETH recv)
>> RDMA READ
>> unregister buffer
>> SEND "got it" mesg
>>
>>
>> This is less back and forth messaging since the initial "I have data to
>> send"
>> message contains the ETH info and the passive side can quickly allocate
>> and
>> read it and then send a single message back.
>>
>> But I admit I don't know your exact requirements so this may not be what
>> you
>> want or need.
>>
>> >
>> > >> wr.sg_list = &sge;
>> > >> wr.num_sge = 1;
>> > >> wr.send_flags = IBV_SEND_SIGNALED;
>> > >> wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
>> > >> wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;
>> > >>
>> > >> ibv_send_wr *bad_wr = NULL;
>> > >> if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0)
>> > >> {
>> > >> notifyError("Unable to ibv post receive");
>> > >> }
>> > >>
>> > >> at this point the next code waiting for completation that is:
>> > >>
>> > >> //Wait for completation
>> > >> ibv_cq *cq;
>> > >> void* cq_context;
>> > >> if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) !=
>> > >> 0) {
>> > >> notifyError("Unable to get a ibv cq event");
>> > >> }
>> > >>
>> > >> ibv_ack_cq_events(cq, 1);
>> > >>
>> > >> if (ibv_req_notify_cq(cq, 0) != 0) {
>> > >> notifyError("Unable to get a req notify");
>> > >> }
>> > >>
>> > >> ibv_wc wc;
>> > >> int myRet = ibv_poll_cq(cq, 1, &wc);
>> > >> if (myRet > 1) {
>> > >> LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
>> > >> }
>> > >>
>> > >>
>> > >> The time from my ibv_post_send and when ibv_get_cq_event returns an
>> > >> event is 13.3ms when transfering chuncks of 8 MB achieving then
>> > >> around 600 MB/s.
>> > >
>> > > It looks like you are waiting for a completion before doing another
>> > > xfer? Is this the case? That may not be the most efficient.
>> >
>> > I have to implement using infiniband as transport level the two
>> > following interfaces:
>> >
>> > Sink::write(buffer)
>> > Source::read(buffer);
>> >
>> > Sink::write and Source::read are the last/first blocks of a pipeline
>> > and the data flow
>> > potentially never ends.
>> >
>
>
> Not knowing your whole application puts us at a disadvantage, but I am
> guessing that the server at the other end of the Infiniband is the largest
> potential source of variable performance. Your incoming data probably comes
> at a somewhat steady rate, and the processing done on that data collection
> node (client side?) is probably running at a steady rate as well. Your
> server has to deal with a highly variable speed device like a disk drive,
> and the Infiniband communications can potentially suffer from interference
> with other traffic. At the risk of repeating myself and what others have
> said, you need to use multiple buffers for sending the data so you can
> tolerate this variability.
Basicaly server and client perform a single connection and then that connection
is used to transmit a continuous data flow (point to point connection).
TX side is a last component of a pipeline and the RX side is the first component
of another pipeline on another physical server.
And the data flow is a continuous data flow with throttled rate.
>>
>> > >>
>> > >> To specify more (in pseudocode what I do globally):
>> > >>
>> > >> Active Side:
>> > >>
>> > >> post a message receive
>> > >> rdma connection
>> > >> wait for rdma connection event
>> > >> <<at this point transfer tx flow starts>>
>> > >> start:
>> > >> register memory containing bytes to transfer
>> > >
>> > > I believe Sean mentioned you should avoid doing memory registration in
>> > > any areas of the code where performance is critical. I agree with him.
>> >
>> > Well I can register / unregister once but that means that each time I
>> > have to transfer something
>> > (see my interfaces above) I have to issue a memcpy on the sending side
>> > and on the receiving side.
>> > Is a memcopy cheaper than a ibv_reg_mr/ibv_dereg_mr ?
>>
>> I suspect so for small messages. I have never profiled it but there is
>> much
>> evidence to this. I wish I could find the paper I read recently regarding
>> efficient RDMA memory usage, sorry.
>
>
> It is a complicated subject. You can benchmark the
> memcpy()/memmove()bcopy() functions to see exactly what your processor does
> (and which one works best). Modern processors easily move over 10GB/sec
> when things are aligned and in the right place, but this can be highly
> variable depending on system architecture. The MPI people probably have
> done the most work in this area, and papers about this can be found on their
> websites. If I read the original post properly, it seems like transfers are
> around 8MB, and I would be inclined to just do RDMA from buffers like that.
> I think I see 1MB indicated below, and I still would be inclined to do RDMA
> and avoid possible complications of the memory copy needing to happen in a
> different cpu core.
>
> I am a little unclear about when rdma connections happen in this
> application. Reading the post, it seems like this is happening for each
> transfer. There is a lot of overhead setting up a connection and tearing it
> down, so I hope I did not read that correctly. Otherwise, you will see a
> significant improvement if you keep track of the connection and only make a
> connection when there is not one, and only remove the connection when it
> fails.
The connection is persistent.
> The creation and destruction of memory regions is an expensive operation.
> It cannot be done with the OS bypass, but instead the verbs library makes a
> request to the verbs driver, which contacts the HCA driver, and then sets up
> (or destroys) the memory region. The OS bypass allows millions of SEND or
> RDMA_* operations per second, while the memory region requests only run at
> thousands per second. Also, remember that a memory region involves locking
> the region's pages in memory, which can be a lengthy process in the
> operating system.
>
> One important optimization is that protection domains are associated with
> HCAs, and memory regions are associated with protection domains. This means
> that you don't need a queue pair or connection to manipulate them. If you
> can tolerate large amounts of memory locked down, which is common in these
> kinds of applicaitons, you should just create a memory region that
> encompasses all of the memory you will be using for your various buffers. A
> more complicated version of this would be to create a memory region for each
> allocation of memory, and then to look up which memory region is associated
> with a specific buffer. I suspect that rsocket code does something like
> this.
Yes that's an idea, I have to be sure (as already is the case) the
buffers are not
continuously allocated/deallocated.
I'll try to create an hash table buffer -> memory region to avoid those
registration/deregistration and I'll post what I get.
> Regards,
> Dave McMillen
>
>>
>>
>> >
>> > >> wait remote memory region addr/key ( I wait for a ibv_wc)
>> > >> send data with ibv_post_send
>> > >> post a message receive
>> > >> wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3
>> > >> ms)
>> > >> send message "DONE"
>> > >> unregister memory
>> > >
>> > > This applies to unregistration of memory as well.
>> > >
>> > >> goto start
>> > >>
>> > >> Passive Side:
>> > >>
>> > >> post a message receive
>> > >> rdma accept
>> > >> wait for rdma connection event
>> > >> <<at this point transfer rx flow starts>>
>> > >> start:
>> > >> register memory that has to receive the bytes
>> > >> send addr/key of memory registered
>> > >> wait "DONE" message
>> > >> unregister memory
>> > >> post a message receive
>> > >> goto start
>> > >>
>> > >> Does anyone knows what I'm doing wrong? Or what I can improve? I'm
>> > >> not
>> > >> affected by
>> > >> "Not Invented Here" syndrome so I'm even open to throw away what I
>> > >> have done until
>> > >> now and adopting something else.
>> > >>
>> > >> I only need a point to point contiguous transfer.
>> > >
>> > > How big is this transfer?
>> > >
>> > > It may be that doing send/recv or write with immediate would work
>> > > better for you. Also have you seen Sean's rsocket project?
>> >
>> > The transfers are around 1MB each time.
>>
>> I suspect that is big enough that doing the registration should be more
>> efficient than the memcopy.
>>
>> One other thing. Do you need to wait for the completion of each buffer
>> before
>> posting the RDMA WRITE of the next?
>>
>> Ira
>>
>> >
>> > I have seen rsocket project, I have played a bit rstream and what I get
>> > is this:
>> >
>> > $ ./examples/rstream -s 10.30.3.2 -S all
>> > name bytes xfers iters total time Gb/sec
>> > usec/xfer
>> > 64_lat 64 1 1m 122m 4.35s 0.24 2.17
>> > 128_lat 128 1 1m 244m 4.70s 0.44 2.35
>> > 192_lat 192 1 1m 366m 4.87s 0.63 2.44
>> > 256_lat 256 1 1m 488m 6.68s 0.61 3.34
>> > 384_lat 384 1 1m 732m 7.14s 0.86 3.57
>> > 512_lat 512 1 1m 976m 8.44s 0.97 4.22
>> > 768_lat 768 1 1m 1.4g 10.35s 1.19 5.18
>> > 1k_lat 1k 1 100k 195m 1.02s 1.60 5.12
>> > 1.5k_lat 1.5k 1 100k 292m 1.32s 1.86 6.60
>> > 2k_lat 2k 1 100k 390m 1.61s 2.03 8.07
>> > 3k_lat 3k 1 100k 585m 1.87s 2.63 9.36
>> > 4k_lat 4k 1 100k 781m 2.39s 2.74 11.95
>> > 6k_lat 6k 1 100k 1.1g 2.83s 3.47 14.15
>> > 8k_lat 8k 1 100k 1.5g 3.51s 3.73 17.56
>> > 12k_lat 12k 1 10k 234m 0.44s 4.45 22.09
>> > 16k_lat 16k 1 10k 312m 0.58s 4.56 28.75
>> > 24k_lat 24k 1 10k 468m 0.76s 5.14 38.25
>> > 32k_lat 32k 1 10k 625m 1.02s 5.12 51.21
>> > 48k_lat 48k 1 10k 937m 1.27s 6.20 63.40
>> > 64k_lat 64k 1 10k 1.2g 1.93s 5.43 96.63
>> > 96k_lat 96k 1 10k 1.8g 2.49s 6.33 124.29
>> > 128k_lat 128k 1 1k 250m 0.30s 7.00 149.89
>> > 192k_lat 192k 1 1k 375m 0.49s 6.48 242.76
>> > 256k_lat 256k 1 1k 500m 0.73s 5.75 364.85
>> > 384k_lat 384k 1 1k 750m 1.10s 5.73 549.16
>> > 512k_lat 512k 1 1k 1000m 1.51s 5.54 757.02
>> > 768k_lat 768k 1 1k 1.4g 1.68s 7.48 841.05
>> > 1m_lat 1m 1 100 200m 0.28s 6.05 1385.61
>> > 1.5m_lat 1.5m 1 100 300m 0.41s 6.20 2029.05
>> > 2m_lat 2m 1 100 400m 0.54s 6.27 2675.73
>> > 3m_lat 3m 1 100 600m 0.55s 9.13 2757.71
>> > 4m_lat 4m 1 100 800m 1.04s 6.45 5205.38
>> > 6m_lat 6m 1 100 1.1g 1.56s 6.46 7794.85
>> > 64_bw 64 1m 1 122m 1.38s 0.74 0.69
>> > 128_bw 128 1m 1 244m 0.83s 2.46 0.42
>> > 192_bw 192 1m 1 366m 1.42s 2.16 0.71
>> > 256_bw 256 1m 1 488m 1.43s 2.87 0.71
>> > 384_bw 384 1m 1 732m 1.46s 4.21 0.73
>> > 512_bw 512 1m 1 976m 1.66s 4.94 0.83
>> > 768_bw 768 1m 1 1.4g 2.35s 5.24 1.17
>> > 1k_bw 1k 100k 1 195m 0.31s 5.34 1.54
>> > 1.5k_bw 1.5k 100k 1 292m 0.44s 5.57 2.21
>> > 2k_bw 2k 100k 1 390m 0.51s 6.41 2.56
>> > 3k_bw 3k 100k 1 585m 0.86s 5.71 4.30
>> > 4k_bw 4k 100k 1 781m 1.02s 6.41 5.11
>> > 6k_bw 6k 100k 1 1.1g 1.53s 6.45 7.63
>> > 8k_bw 8k 100k 1 1.5g 2.04s 6.42 10.21
>> > 12k_bw 12k 10k 1 234m 0.30s 6.46 15.22
>> > 16k_bw 16k 10k 1 312m 0.40s 6.48 20.21
>> > 24k_bw 24k 10k 1 468m 0.60s 6.55 30.04
>> > 32k_bw 32k 10k 1 625m 0.81s 6.51 40.27
>> > 48k_bw 48k 10k 1 937m 1.20s 6.53 60.21
>> > 64k_bw 64k 10k 1 1.2g 1.60s 6.54 80.16
>> > 96k_bw 96k 10k 1 1.8g 2.33s 6.75 116.48
>> > 128k_bw 128k 1k 1 250m 0.32s 6.51 161.03
>> > 192k_bw 192k 1k 1 375m 0.48s 6.52 241.36
>> > 256k_bw 256k 1k 1 500m 0.64s 6.51 321.99
>> > 384k_bw 384k 1k 1 750m 0.78s 8.06 390.40
>> > 512k_bw 512k 1k 1 1000m 1.29s 6.52 643.09
>> > 768k_bw 768k 1k 1 1.4g 1.97s 6.38 986.84
>> > 1m_bw 1m 100 1 200m 0.26s 6.37 1316.86
>> > 1.5m_bw 1.5m 100 1 300m 0.27s 9.36 1343.65
>> > 2m_bw 2m 100 1 400m 0.53s 6.36 2638.12
>> > 3m_bw 3m 100 1 600m 0.80s 6.31 3988.59
>> > 4m_bw 4m 100 1 800m 1.07s 6.28 5341.27
>> > 6m_bw 6m 100 1 1.1g 1.00s 10.09 4988.12
>> >
>> > So it seems that a good buffer size is at 6MB getting 10.09 Gb/sec
>> > (1291 MB/sec) and that is quite good.
>> >
>> > But performing only that 6m test I get only 6.8Gb/sec:
>> >
>> > $ ./examples/rstream -s 10.30.3.2 -S 6291456 -C 100
>> > name bytes xfers iters total time Gb/sec
>> > usec/xfer
>> > custom 6m 100 1 1.1g 1.48s 6.81 7395.56
>> >
>> >
>> > Sean told me that running "custom" size tests the setting for the
>> > transfer are different, I did a look
>> > at the code and for sure with "custom" tests the optimization for bw:
>> > val = 0;
>> > rs_setsockopt(rs, SOL_RDMA, RDMA_INLINE, &val, sizeof val);
>> >
>> > is not done but even forcing that call the Gb/sec I'm getting for 6m
>> > transfer is still 6.81 Gb/sec
>> > (871 MB/sec).
>> >
>> > Gaetano
>> >
>> >
>> >
>> > > Hope this helps,
>> > > Ira
>> > >
>> > >>
>> > >>
>> > >> Regards
>> > >> Gaetano Mendola
>> > >>
>> > >>
>> > >> --
>> > >> cpp-today.blogspot.com
>> > >> _______________________________________________
>> > >> Users mailing list
>> > >> Users at lists.openfabrics.org
>> > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>> > >
>> > >
>> > > --
>> > > Ira Weiny
>> > > Member of Technical Staff
>> > > Lawrence Livermore National Lab
>> > > 925-423-8008
>> > > weiny2 at llnl.gov
>> >
>> >
>> >
>> > --
>> > cpp-today.blogspot.com
>>
>>
>> --
>> Ira Weiny
>> Member of Technical Staff
>> Lawrence Livermore National Lab
>> 925-423-8008
>> weiny2 at llnl.gov
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>
>
--
cpp-today.blogspot.com
More information about the Users
mailing list