[ewg] IPoIB to Ethernet routing performance

matthieu hautreux matthieu.hautreux at gmail.com
Fri Dec 17 11:03:08 PST 2010


2010/12/17 Roland Dreier <rdreier at cisco.com>

>  >   This may be due to the fact that the IB MTU is 2048. Every 1500 bytes
> packet
>  > is padded to 2048 bytes before being sent through the wire, so you're
> loosing
>  > roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of
> 2048.
>
> This isn't true.  IB packets are only padded to a multiple of 4 bytes.
>
> However there's no point in using IPoIB connected mode to pass packets
> smaller than the IB MTU -- you might as well use datagram mode.
>

We are using infiniband as an HPC cluster interconnect network and our
compute nodes use this technology to exchange data in IPoIB with a MTU of
65520, do RDMA MPI communications and access Lustre filesystems. On top of
that, some nodes are connected to both the IB interconnect and an external
ethernet network. These nodes act as IP routers and enable compute nodes to
access site centric resources (home directories using nfs, LDAP, ...).
Compute nodes are using IPoIB with a large MTU to contact the router nodes
so we get really good performances when we only communicate with the
routers. However, as soon as the compute nodes communicate with the external
ethernet world, the TCP path MTU discovery automatically reduces IPoIB MTU
to 1500, the ethernet MTU, and we touch this 4.6Gbit/s wall.

Using datagram mode in our scenario is not possible as it will reduce the
cluster internal performances in IPoIB. What we 'd rather have is an
ipoib_cm that would better handle small packet. Do you think that this
limitation is a HCA hardware limitation (number of packets per second) or a
software limitation (number of packet processed per second) ? I would think
that it is a software limitation as better results are achieved in datagram
mode with a same 1500 bytes MTU. IPoIB in connected mode seems to use a
single completion queue with a single MSI vector for all the queue pairs it
creates to communicate. Perhaps that multiplying the number of completion
queues and MSI vectors could help to spread/parallelize the load and get
better results. What is your feeling about that ?

Regards,
Matthieu


In fact, we really need to use IPoIB connected mode as


>
>  - R.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20101217/f778c5d7/attachment.html>


More information about the ewg mailing list