<div class="gmail_quote">2010/12/17 Roland Dreier <span dir="ltr"><<a href="mailto:rdreier@cisco.com">rdreier@cisco.com</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im"> >   This may be due to the fact that the IB MTU is 2048. Every 1500 bytes packet<br>

 > is padded to 2048 bytes before being sent through the wire, so you're loosing<br>

 > roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of 2048.<br>

<br>

</div>This isn't true.  IB packets are only padded to a multiple of 4 bytes.<br>

<br>

However there's no point in using IPoIB connected mode to pass packets<br>

smaller than the IB MTU -- you might as well use datagram mode.<br></blockquote><div><br>We are using infiniband as an HPC cluster interconnect network and our compute nodes use this technology to exchange data in IPoIB with a MTU of 65520, do RDMA MPI communications and access Lustre filesystems. On top of that, some nodes are connected to both the IB interconnect and an external ethernet network. These nodes act as IP routers and enable compute nodes to access site centric resources (home directories using nfs, LDAP, ...). Compute nodes are using IPoIB with a large MTU to contact the router nodes so we get really good performances when we only communicate with the routers. However, as soon as the compute nodes communicate with the external ethernet world, the TCP path MTU discovery automatically reduces IPoIB MTU to 1500, the ethernet MTU, and we touch this 4.6Gbit/s wall.<br>

<br>Using datagram mode in our scenario is not possible as it will reduce the cluster internal performances in IPoIB. What we 'd rather have is an ipoib_cm that would better handle small packet. Do you think that this limitation is a HCA hardware limitation (number of packets per second) or a software limitation (number of packet processed per second) ? I would think that it is a software limitation as better results are achieved in datagram mode with a same 1500 bytes MTU. IPoIB in connected mode seems to use a single completion queue with a single MSI vector for all the queue pairs it creates to communicate. Perhaps that multiplying the number of completion queues and MSI vectors could help to spread/parallelize the load and get better results. What is your feeling about that ?<br>

<br>Regards,<br>Matthieu<br><br><br>In fact, we really need to use IPoIB connected mode as <br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<font color="#888888"><br>

 - R.<br>

</font></blockquote></div><br>