[ofa-general] IPoIB connected vs datagram

Thu Aug 27 05:30:52 PDT 2009

Hi!

I'm having some strange problems on an InfiniBand fabric at work. We 
have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012 
IB switch. There are also several Sun "thumpers" running solaris that 
are also connected to the infiniband fabric, however their HCAs are only 
SDR. There are several 20 odd terabyte nfs mounts exported from the 
thumpers and mounted to the compute nodes over IPoIB (we're not using 
NFS RDMA). Opensm is running on the head node and all of the compute 
nodes for redundancys sake. Things were running OK until yesterday when 
a user crashed the head node by sucking up all of its memory, and at the 
time the head node's subnet manager was in the master state. Well, a 
different node quickly picked up subnet management until the head node 
was rebooted at which point the head node became the subnet master.

Since logging back in to the cluster after rebooting the head node, the 
nfs mounts from the thumpers have been hanging periodically all over the 
place. I know that two of the thumpers and their nfs exports are being 
hit with an aggregate of about 120MB/s of nfs traffic from about 30 or 
so compute nodes, so I'm sure that's not helping things, however one of 
the other thumpers that has no active jobs hitting its exports 
periodically shows nfs server "not responding" message on the 
clients/compute nodes. I checked the log files for the past week- these 
nfs server not responding messages all started since the head node crash 
yesterday. From what I've been told, every time this happens the only 
fix is to reboot the switch.

Of course, any general debugging suggestions would be appreciated, but I 
have a few specific questions regarding IPoIB and connected vs datagram. 
All of the compute nodes and the head node (running ofed 1.4) are using 
"connected mode" for IPoIB ->

[root at headnode ~]# cat /sys/class/net/ib0/mode
connected

and the mtu of the interface is 65520

I don't know how to determine if the solaris (the thumpers) systems are 
using connected mode, but their MTUs are 2044 which leads me to believe 
they're probably not. I cannot log into these machines as I don't manage 
them, but is there a way to determine the IPoIB mtu using an ib* 
utility? Or am I misunderstanding IPoIB that such information wouldn't 
be useful.

And lastly, I recall that with TCP over ethernet if you have the mtu 
said to say 9000 and try and sling data to a box with an mtu of 1500 you 
get some weird performance hits. Is it likely that the compute nodes use 
of the larger MTU + connected mode paired with the thumpers much smaller 
MTU + probably datagram mode could be causing timeouts under heavy load? 
Does anybody think that settings the compute/head nodes to datagram mode 
and subsequently dropping the mtu to 2044 would help my situation?

Again, any suggestions are greatly appreciated, and thanks in advance 
for any replies!

-Aaron