From mrobbert at gmail.com  Wed Jan 30 16:43:15 2019
From: mrobbert at gmail.com (Michael Robbert)
Date: Thu, 31 Jan 2019 00:43:15 -0000
Subject: [Users] IPoIB ping timeouts
Message-ID: <CAEftmKScxCs8Hx2nYZzcyTvF8jFToaP_P362rNyRyHdMw_fX2A@mail.gmail.com>

I observed a very odd problem on our cluster today and I'm not sure how
best to troubleshoot it. We have a heterogenous HPC cluster with a little
over 200 nodes. Most nodes are x86(several generations), most are running
CentOS 6.7 and have the OS OFED stack installed. All the IB switches are
from Mellanox, most of the HCAs are from Mellanox (again several
generations). This setup has been running fine for years. Today we started
having problems with GPFS (which uses verbs on the IB fabric) and I was
able to track it down to the fact that a small subset of nodes couldn't
ping a small subset of other nodes only on the IB subnet. This only became
obvious because one of the nodes was our login node and GPFS was expelling
it from the cluster so the filesystem was disappearing. I discovered that
all of the compute nodes that were seeing this loss of ping to the login
node were involved with a single job. When I killed that job ping between
the nodes on the IB subnet started working again. I copied the job data and
started it on another set of nodes while periodically monitoring them with
ping. I did sporadically see ping stop working to 2 of the 6 nodes that I
was running on. I'm not sure if this is relevant or not, but in all cases
that I observed the compute nodes that saw this problem have the very old
InfiniHost III HCA. When this problem happens I am able to login to the
node over the ethernet interface and don't see anything wrong. The CPU is
busy, but the job is configured to not use all the cores so there are
plenty of idle cycles, system memory isn't starved, and the arp tables for
the IB subnet look fine. I was even able to run tcpdump on the node and
noted that it was not seeing any packets coming from a host that wasn't
able to ping it, but if I did a ping from the compute node to the login
node then the login node was seeing the ping requests and sending replies,
but they weren't arriving. There was however other traffic from other hosts
on the fabric that appeared to be working fine.
So, with the long description what I'm wondering is what can I look for
that might be causing these nodes to lose IPoIB connectivity while some
particular code is running on them? Is there a particular lower level
connectivity checker that I should use? Other things to check at the OS
level to see if some resource is full?
Any pointers in the right direction would be much appreciated.
Thanks,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20190131/da06efdc/attachment.html>