[Users] infiniband rdma poor transfer bw

David McMillen davem at systemfabricworks.com
Tue Aug 28 07:26:57 PDT 2012


This message is getting too long for reposting inline, so I have included
some comments here.

First, I would be remiss if I did not suggest trying to use MPI.  A
significant amount of work has gone into performance optimization for the
various MPI packages.  Also, they are quite flexible about the
communications path used, so if your environment changes it is likely that
it will still work well.  Most MPI programs are written to be a single
image that runs in parallel on multiple cores, but this is not a
requirement.  Each piece (rank) of an MPI job can be a completely different
program if desired.  Furthermore, there are functions that allow for
client-server style of operation.  If you google for "mpi client server"
you will find a number of ideas, or let me know and I can give you more
information.

Back to commenting on the existing posts:

Your maximum of 1500 MB/second just seems wrong, although I don't know what
your hardware is.  In my experience with QDR links, PCIe 2.0 8x will run
close to 3000, with observed numbers as high as 3200 and as low as 2500.  I
generally expect at least 2800 for an aggressive application like
ib_write_bw.  Here are some common problems you might look for:

   1) The 1500 number is what I would expect from using a PCIe slot that
was physically able to accept an 8x card, but only implemented 4x for the
connections.  You should check the documentation for the motherboard to see
if that is what is happening, as it is common for many motherboards to have
a slot like this.  You can also look at the output of "lspci -vv" where you
will see a line with something like "LnkCap:" showing the width the device
is capable of using and another line with something like "LnkSta:" showing
the width the device is actually using.  If this is happening and you have
a true 8x slot available, you should move the card.  Note that this problem
could be on only one system and it would slow down both.

  2) Your system may have NUMA memory issues.  Look at the output of
"numactl --hardware" and see how many nodes are available (first line).  If
there is more than 1 available node, you may be falling victim to the
internal movement of the data across NUMA nodes.  This usually shows up as
inconsistent runs, which you have observed with the rsocket tests, so there
may be something to this.  I have seen systems with ib_write_bw test
results that reach 3000 MB/s when positioned on the best NUMA node, and
then as low as 1200 MB/s when running on the worst NUMA node.  You can
investigate this further by doing ib_write_bw tests using the numactl
command to force a particular NUMA node to be used.  Assuming problems may
exist on both ends of the link, you need to run the test with "numactl
--membind=0 --cpunodebind=0 ib_write_bw -a" through "numactl --membind=N
--cpunodebind=N ib_write_bw -a" on the server side (N being the largest
node available).  For each of the NUMA nodes on the server, you would then
run the client using  "numactl --membind=0 --cpunodebind=0 ib_write_bw -a
serverip" through "numactl --membind=N --cpunodebind=N ib_write_bw -a
serverip" for all NUMA nodes on the client system.  It will be clear which
NUMA node(s) are giving you the best throughput.  If your application can
fit within the memory and cpu constraints of those NUMA nodes, you can
simply run your application under the same constraints (the node specified
can be a list of nodes if more than one gives good results).

  3) Perhaps your link is running at DDR speed instead of QDR speed,
although even with DDR I would expect a number above 1900 MB/s.  Look at
the output of "ibstatus" on both the server and the client.  If there are
switch links involved you should look at them as well -- "ibnetdiscover
--ports" shows link width and speed, but you have to find the links in use
in that output.

With respect to the question about one side knowing that the other side is
done, another choice is to use IBV_WR_RDMA_WRITE_WITH_IMM which will create
a completion for the recipient of the data.  However, in my experience
there is the need for some kind of flow control (ready/done) messages to be
sent in both directions using IBV_WR_SEND anyway, as Ira suggests.  It
isn't so much the use of RDMA_READ versus RDMA_WRITE as it is a concept of
the client saying "server, go do this transaction" and the server
responding with "transaction done".  For the highest speed operations, you
need to set it up so the client can request multiple transactions (at least
two, and if disk transfers are involved it should be at least 0.25 seconds
worth, ideally a whole second) before seeing a completion from the server.

Not knowing your whole application puts us at a disadvantage, but I am
guessing that the server at the other end of the Infiniband is the largest
potential source of variable performance.  Your incoming data probably
comes at a somewhat steady rate, and the processing done on that data
collection node (client side?) is probably running at a steady rate as
well.  Your server has to deal with a highly variable speed device like a
disk drive, and the Infiniband communications can potentially suffer from
interference with other traffic.  At the risk of repeating myself and what
others have said, you need to use multiple buffers for sending the data so
you can tolerate this variability.

With respect to the question of copying data between buffers or dealing
with the overhead of memory registration, it is a complicated subject.  You
can benchmark the memcpy()/memmove()bcopy() functions to see exactly what
your processor does (and which one works best), but this will change with
each hardware platform.  Modern processors easily move over 10GB/sec when
things are aligned and in the right place, but this can be highly variable
depending on system architecture.  The MPI people probably have done the
most work in this area, and papers about this can be found on their
websites.  If I read the original post properly, it seems like transfers
are around 8MB, and I would be inclined to just do RDMA from buffers like
that.  I think I see 1MB indicated below, and I still would be inclined to
do RDMA and avoid possible complications of the memory copy needing to
happen in a different cpu core.

I am a little unclear about when rdma connections happen in this
application.  Reading the post, it seems like this is happening for each
transfer.  There is a lot of overhead setting up a connection and tearing
it down, so I hope I did not read that correctly.  Otherwise, you will see
a significant improvement if you keep track of the connection and only make
a connection when there is not one, and only remove the connection when it
fails.

The creation and destruction of memory regions is an expensive operation.
 It cannot be done with the OS bypass, but instead the verbs library makes
a request to the verbs driver, which contacts the HCA driver, and then sets
up (or destroys) the memory region.  The OS bypass allows millions of SEND
or RDMA_* operations per second, while the memory region requests only run
at thousands per second.  Also, remember that a memory region involves
locking the region's pages in memory, which can be a lengthy process in the
operating system.

One important optimization is that protection domains are associated with
HCAs, and memory regions are associated with protection domains.  This
means that you don't need a queue pair or connection to manipulate them.
 If you can tolerate large amounts of memory locked down, which is common
in these kinds of applicaitons, you should just create a memory region that
encompasses all of the memory you will be using for your various buffers.
 A more complicated version of this would be to create a memory region for
each allocation of memory, and then to look up which memory region is
associated with a specific buffer.  I suspect that rsocket code does
something like this.

Regards,
    Dave McMillen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20120828/2a699acb/attachment.html>


More information about the Users mailing list