[ofa-general] IPoIB connected vs datagram

Thu Aug 27 08:41:40 PDT 2009

Thanks for the reply!

Good to know about the "true" MTU vs the synthetic mtu. I wasn't aware of
that.

The NFS is NFS over TCP and the read/write sizes are both set to 32768.

I don't have any routes that I know of on the IB fabric- a traceroute seemed
to verify this. I used tracepath to show me the mtu information between the
two hosts. On the second attempt it looks like it "discovered" the correct
MTU -

[root at headnode ~]# tracepath thumper1-ib
 1:  headnode (10.0.1.1)                       0.133ms pmtu 65520
 1:  thumper1-ib (10.0.1.245)                0.161ms reached
     Resume: pmtu 2044 hops 1 back 1
[root at headnode ~]# tracepath thumper1-ib
 1:  headnode (10.0.1.1)                       0.122ms pmtu 2044
 1:  thumper1-ib (10.0.1.245)                0.121ms reached
     Resume: pmtu 2044 hops 1 back 1

We rebooted the infiniband switch which cleared up the NFS issues for now.
The one thing I noticed after the reboot was that the solars storage servers
were back in the multicast group (saquery -m). It's definitely an odd
situation...

Thanks again for your help

On Thu, Aug 27, 2009 at 11:26 AM, Nifty Tom Mitchell <niftyompi at niftyegg.com
> wrote:

> On Thu, Aug 27, 2009 at 08:30:52AM -0400, Aaron Knister wrote:
> >
> > Hi!
> >
> > I'm having some strange problems on an InfiniBand fabric at work. We
> > have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012
> > IB switch. There are also several Sun "thumpers" running solaris that
> > are also connected to the infiniband fabric, however their HCAs are only
> > SDR. There are several 20 odd terabyte nfs mounts exported from the
> > thumpers and mounted to the compute nodes over IPoIB (we're not using
> > NFS RDMA). Opensm is running on the head node and all of the compute
> > nodes for redundancys sake. Things were running OK until yesterday when
> > a user crashed the head node by sucking up all of its memory, and at the
> > time the head node's subnet manager was in the master state. Well, a
> > different node quickly picked up subnet management until the head node
> > was rebooted at which point the head node became the subnet master.
> >
> > Since logging back in to the cluster after rebooting the head node, the
> > nfs mounts from the thumpers have been hanging periodically all over the
> > place. I know that two of the thumpers and their nfs exports are being
> > hit with an aggregate of about 120MB/s of nfs traffic from about 30 or
> > so compute nodes, so I'm sure that's not helping things, however one of
> > the other thumpers that has no active jobs hitting its exports
> > periodically shows nfs server "not responding" message on the
> > clients/compute nodes. I checked the log files for the past week- these
> > nfs server not responding messages all started since the head node crash
> > yesterday. From what I've been told, every time this happens the only
> > fix is to reboot the switch.
> >
> > Of course, any general debugging suggestions would be appreciated, but I
> > have a few specific questions regarding IPoIB and connected vs datagram.
> > All of the compute nodes and the head node (running ofed 1.4) are using
> > "connected mode" for IPoIB ->
> >
> > [root at headnode ~]# cat /sys/class/net/ib0/mode
> > connected
> >
> > and the mtu of the interface is 65520
> >
> > I don't know how to determine if the solaris (the thumpers) systems are
> > using connected mode, but their MTUs are 2044 which leads me to believe
> > they're probably not. I cannot log into these machines as I don't manage
> > them, but is there a way to determine the IPoIB mtu using an ib*
> > utility? Or am I misunderstanding IPoIB that such information wouldn't
> > be useful.
> >
> > And lastly, I recall that with TCP over ethernet if you have the mtu
> > said to say 9000 and try and sling data to a box with an mtu of 1500 you
> > get some weird performance hits. Is it likely that the compute nodes use
> > of the larger MTU + connected mode paired with the thumpers much smaller
> > MTU + probably datagram mode could be causing timeouts under heavy load?
> > Does anybody think that settings the compute/head nodes to datagram mode
> > and subsequently dropping the mtu to 2044 would help my situation?
> >
> > Again, any suggestions are greatly appreciated, and thanks in advance
> > for any replies!
>
> Look at the MTU choices again.
> With Infiniband the "true" MTU is fixed at 2K (or 4K) and often limited
> to 2K by the switch firmware.   Larger MTUs are thus synthetic and force
> software to
> assemble and disassemble the transfers.  On a fabric the large MTU for
> IPoIB
> works well because the fabric is quite reliable.  When data is routed
> to another network with a smaller MTU software needs to assemble and
> disassemble the
> fragments.   Fragmentation can be expensive.  Dropped bits and
> fragmentation is
> a major performance hit.    Normal MTU discovery should make fragmentation
> go away.
>
> Ethernet jumbo packets (larger than 1500) are real on the wire.
> This is not the case on IB where the MTU is fixed.
>
> Is the NFS NFS over UDP or TCP ?
> What are the NFS read/ write sizes set to?
>
> Double check routes (traceroute).  Dynamic routes and mixed MTUs is a
> tangle.
> The minimum MTU for a route can be discovered with ping and the do not
> fragment flag
> as long as ICMP packets are not filtered.
>
> --
>        T o m  M i t c h e l l
>        Found me a new hat, now what?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090827/80cbf47d/attachment.html>