[ofa-general] IPoIB connected vs datagram
Aaron Knister
aaron.knister at gmail.com
Thu Aug 27 05:30:52 PDT 2009
Hi!
I'm having some strange problems on an InfiniBand fabric at work. We
have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012
IB switch. There are also several Sun "thumpers" running solaris that
are also connected to the infiniband fabric, however their HCAs are only
SDR. There are several 20 odd terabyte nfs mounts exported from the
thumpers and mounted to the compute nodes over IPoIB (we're not using
NFS RDMA). Opensm is running on the head node and all of the compute
nodes for redundancys sake. Things were running OK until yesterday when
a user crashed the head node by sucking up all of its memory, and at the
time the head node's subnet manager was in the master state. Well, a
different node quickly picked up subnet management until the head node
was rebooted at which point the head node became the subnet master.
Since logging back in to the cluster after rebooting the head node, the
nfs mounts from the thumpers have been hanging periodically all over the
place. I know that two of the thumpers and their nfs exports are being
hit with an aggregate of about 120MB/s of nfs traffic from about 30 or
so compute nodes, so I'm sure that's not helping things, however one of
the other thumpers that has no active jobs hitting its exports
periodically shows nfs server "not responding" message on the
clients/compute nodes. I checked the log files for the past week- these
nfs server not responding messages all started since the head node crash
yesterday. From what I've been told, every time this happens the only
fix is to reboot the switch.
Of course, any general debugging suggestions would be appreciated, but I
have a few specific questions regarding IPoIB and connected vs datagram.
All of the compute nodes and the head node (running ofed 1.4) are using
"connected mode" for IPoIB ->
[root at headnode ~]# cat /sys/class/net/ib0/mode
connected
and the mtu of the interface is 65520
I don't know how to determine if the solaris (the thumpers) systems are
using connected mode, but their MTUs are 2044 which leads me to believe
they're probably not. I cannot log into these machines as I don't manage
them, but is there a way to determine the IPoIB mtu using an ib*
utility? Or am I misunderstanding IPoIB that such information wouldn't
be useful.
And lastly, I recall that with TCP over ethernet if you have the mtu
said to say 9000 and try and sling data to a box with an mtu of 1500 you
get some weird performance hits. Is it likely that the compute nodes use
of the larger MTU + connected mode paired with the thumpers much smaller
MTU + probably datagram mode could be causing timeouts under heavy load?
Does anybody think that settings the compute/head nodes to datagram mode
and subsequently dropping the mtu to 2044 would help my situation?
Again, any suggestions are greatly appreciated, and thanks in advance
for any replies!
-Aaron
More information about the general
mailing list