<div><br></div>This message is getting too long for reposting inline, so I have included some comments here.<div><div><br></div><div>First, I would be remiss if I did not suggest trying to use MPI.  A significant amount of work has gone into performance optimization for the various MPI packages.  Also, they are quite flexible about the communications path used, so if your environment changes it is likely that it will still work well.  Most MPI programs are written to be a single image that runs in parallel on multiple cores, but this is not a requirement.  Each piece (rank) of an MPI job can be a completely different program if desired.  Furthermore, there are functions that allow for client-server style of operation.  If you google for "mpi client server" you will find a number of ideas, or let me know and I can give you more information.</div>

<div><br></div></div><div>Back to commenting on the existing posts:</div><div><br></div><div>Your maximum of 1500 MB/second just seems wrong, although I don't know what your hardware is.  In my experience with QDR links, PCIe 2.0 8x will run close to 3000, with observed numbers as high as 3200 and as low as 2500.  I generally expect at least 2800 for an aggressive application like ib_write_bw.  Here are some common problems you might look for:</div>

<div><br></div><div>   1) The 1500 number is what I would expect from using a PCIe slot that was physically able to accept an 8x card, but only implemented 4x for the connections.  You should check the documentation for the motherboard to see if that is what is happening, as it is common for many motherboards to have a slot like this.  You can also look at the output of "lspci -vv" where you will see a line with something like "LnkCap:" showing the width the device is capable of using and another line with something like "LnkSta:" showing the width the device is actually using.  If this is happening and you have a true 8x slot available, you should move the card.  Note that this problem could be on only one system and it would slow down both.</div>

<div><br></div><div>  2) Your system may have NUMA memory issues.  Look at the output of "numactl --hardware" and see how many nodes are available (first line).  If there is more than 1 available node, you may be falling victim to the internal movement of the data across NUMA nodes.  This usually shows up as inconsistent runs, which you have observed with the rsocket tests, so there may be something to this.  I have seen systems with ib_write_bw test results that reach 3000 MB/s when positioned on the best NUMA node, and then as low as 1200 MB/s when running on the worst NUMA node.  You can investigate this further by doing ib_write_bw tests using the numactl command to force a particular NUMA node to be used.  Assuming problems may exist on both ends of the link, you need to run the test with "numactl --membind=0 --cpunodebind=0 ib_write_bw -a" through "numactl --membind=N --cpunodebind=N ib_write_bw -a" on the server side (N being the largest node available).  For each of the NUMA nodes on the server, you would then run the client using  "numactl --membind=0 --cpunodebind=0 ib_write_bw -a serverip" through "numactl --membind=N --cpunodebind=N ib_write_bw -a serverip" for all NUMA nodes on the client system.  It will be clear which NUMA node(s) are giving you the best throughput.  If your application can fit within the memory and cpu constraints of those NUMA nodes, you can simply run your application under the same constraints (the node specified can be a list of nodes if more than one gives good results).</div>

<div><br></div><div>  3) Perhaps your link is running at DDR speed instead of QDR speed, although even with DDR I would expect a number above 1900 MB/s.  Look at the output of "ibstatus" on both the server and the client.  If there are switch links involved you should look at them as well -- "ibnetdiscover --ports" shows link width and speed, but you have to find the links in use in that output.</div>

<div><br></div><div>With respect to the question about one side knowing that the other side is done, another choice is to use IBV_WR_RDMA_WRITE_WITH_IMM which will create a completion for the recipient of the data.  However, in my experience there is the need for some kind of flow control (ready/done) messages to be sent in both directions using IBV_WR_SEND anyway, as Ira suggests.  It isn't so much the use of RDMA_READ versus RDMA_WRITE as it is a concept of the client saying "server, go do this transaction" and the server responding with "transaction done".  For the highest speed operations, you need to set it up so the client can request multiple transactions (at least two, and if disk transfers are involved it should be at least 0.25 seconds worth, ideally a whole second) before seeing a completion from the server.</div>

<div><br></div><div>Not knowing your whole application puts us at a disadvantage, but I am guessing that the server at the other end of the Infiniband is the largest potential source of variable performance.  Your incoming data probably comes at a somewhat steady rate, and the processing done on that data collection node (client side?) is probably running at a steady rate as well.  Your server has to deal with a highly variable speed device like a disk drive, and the Infiniband communications can potentially suffer from interference with other traffic.  At the risk of repeating myself and what others have said, you need to use multiple buffers for sending the data so you can tolerate this variability.</div>

<div><br></div><div><div>With respect to the question of copying data between buffers or dealing with the overhead of memory registration, it is a complicated subject.  You can benchmark the memcpy()/memmove()bcopy() functions to see exactly what your processor does (and which one works best), but this will change with each hardware platform.  Modern processors easily move over 10GB/sec when things are aligned and in the right place, but this can be highly variable depending on system architecture.  The MPI people probably have done the most work in this area, and papers about this can be found on their websites.  If I read the original post properly, it seems like transfers are around 8MB, and I would be inclined to just do RDMA from buffers like that.  I think I see 1MB indicated below, and I still would be inclined to do RDMA and avoid possible complications of the memory copy needing to happen in a different cpu core.</div>

<div><br></div><div>I am a little unclear about when rdma connections happen in this application.  Reading the post, it seems like this is happening for each transfer.  There is a lot of overhead setting up a connection and tearing it down, so I hope I did not read that correctly.  Otherwise, you will see a significant improvement if you keep track of the connection and only make a connection when there is not one, and only remove the connection when it fails.</div>

<div><br></div><div>The creation and destruction of memory regions is an expensive operation.  It cannot be done with the OS bypass, but instead the verbs library makes a request to the verbs driver, which contacts the HCA driver, and then sets up (or destroys) the memory region.  The OS bypass allows millions of SEND or RDMA_* operations per second, while the memory region requests only run at thousands per second.  Also, remember that a memory region involves locking the region's pages in memory, which can be a lengthy process in the operating system.</div>

<div><br></div><div>One important optimization is that protection domains are associated with HCAs, and memory regions are associated with protection domains.  This means that you don't need a queue pair or connection to manipulate them.  If you can tolerate large amounts of memory locked down, which is common in these kinds of applicaitons, you should just create a memory region that encompasses all of the memory you will be using for your various buffers.  A more complicated version of this would be to create a memory region for each allocation of memory, and then to look up which memory region is associated with a specific buffer.  I suspect that rsocket code does something like this.</div>

<div><br></div><div>Regards,</div><div>    Dave McMillen</div></div><div><br></div><div> </div>