[ofa-general] IPoIB connected vs datagram

Thu Aug 27 08:26:34 PDT 2009

On Thu, Aug 27, 2009 at 08:30:52AM -0400, Aaron Knister wrote:
> 
> Hi!
>
> I'm having some strange problems on an InfiniBand fabric at work. We  
> have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012  
> IB switch. There are also several Sun "thumpers" running solaris that  
> are also connected to the infiniband fabric, however their HCAs are only  
> SDR. There are several 20 odd terabyte nfs mounts exported from the  
> thumpers and mounted to the compute nodes over IPoIB (we're not using  
> NFS RDMA). Opensm is running on the head node and all of the compute  
> nodes for redundancys sake. Things were running OK until yesterday when  
> a user crashed the head node by sucking up all of its memory, and at the  
> time the head node's subnet manager was in the master state. Well, a  
> different node quickly picked up subnet management until the head node  
> was rebooted at which point the head node became the subnet master.
>
> Since logging back in to the cluster after rebooting the head node, the  
> nfs mounts from the thumpers have been hanging periodically all over the  
> place. I know that two of the thumpers and their nfs exports are being  
> hit with an aggregate of about 120MB/s of nfs traffic from about 30 or  
> so compute nodes, so I'm sure that's not helping things, however one of  
> the other thumpers that has no active jobs hitting its exports  
> periodically shows nfs server "not responding" message on the  
> clients/compute nodes. I checked the log files for the past week- these  
> nfs server not responding messages all started since the head node crash  
> yesterday. From what I've been told, every time this happens the only  
> fix is to reboot the switch.
>
> Of course, any general debugging suggestions would be appreciated, but I  
> have a few specific questions regarding IPoIB and connected vs datagram.  
> All of the compute nodes and the head node (running ofed 1.4) are using  
> "connected mode" for IPoIB ->
>
> [root at headnode ~]# cat /sys/class/net/ib0/mode
> connected
>
> and the mtu of the interface is 65520
>
> I don't know how to determine if the solaris (the thumpers) systems are  
> using connected mode, but their MTUs are 2044 which leads me to believe  
> they're probably not. I cannot log into these machines as I don't manage  
> them, but is there a way to determine the IPoIB mtu using an ib*  
> utility? Or am I misunderstanding IPoIB that such information wouldn't  
> be useful.
>
> And lastly, I recall that with TCP over ethernet if you have the mtu  
> said to say 9000 and try and sling data to a box with an mtu of 1500 you  
> get some weird performance hits. Is it likely that the compute nodes use  
> of the larger MTU + connected mode paired with the thumpers much smaller  
> MTU + probably datagram mode could be causing timeouts under heavy load?  
> Does anybody think that settings the compute/head nodes to datagram mode  
> and subsequently dropping the mtu to 2044 would help my situation?
>
> Again, any suggestions are greatly appreciated, and thanks in advance  
> for any replies!

Look at the MTU choices again.
With Infiniband the "true" MTU is fixed at 2K (or 4K) and often limited
to 2K by the switch firmware.   Larger MTUs are thus synthetic and force software to 
assemble and disassemble the transfers.  On a fabric the large MTU for IPoIB
works well because the fabric is quite reliable.  When data is routed 
to another network with a smaller MTU software needs to assemble and disassemble the
fragments.   Fragmentation can be expensive.  Dropped bits and fragmentation is 
a major performance hit.    Normal MTU discovery should make fragmentation go away.

Ethernet jumbo packets (larger than 1500) are real on the wire.
This is not the case on IB where the MTU is fixed.

Is the NFS NFS over UDP or TCP ?
What are the NFS read/ write sizes set to?

Double check routes (traceroute).  Dynamic routes and mixed MTUs is a tangle.
The minimum MTU for a route can be discovered with ping and the do not fragment flag
as long as ICMP packets are not filtered.

-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?