From mrobbert at gmail.com Wed Jan 30 16:43:15 2019 From: mrobbert at gmail.com (Michael Robbert) Date: Thu, 31 Jan 2019 00:43:15 -0000 Subject: [Users] IPoIB ping timeouts Message-ID: I observed a very odd problem on our cluster today and I'm not sure how best to troubleshoot it. We have a heterogenous HPC cluster with a little over 200 nodes. Most nodes are x86(several generations), most are running CentOS 6.7 and have the OS OFED stack installed. All the IB switches are from Mellanox, most of the HCAs are from Mellanox (again several generations). This setup has been running fine for years. Today we started having problems with GPFS (which uses verbs on the IB fabric) and I was able to track it down to the fact that a small subset of nodes couldn't ping a small subset of other nodes only on the IB subnet. This only became obvious because one of the nodes was our login node and GPFS was expelling it from the cluster so the filesystem was disappearing. I discovered that all of the compute nodes that were seeing this loss of ping to the login node were involved with a single job. When I killed that job ping between the nodes on the IB subnet started working again. I copied the job data and started it on another set of nodes while periodically monitoring them with ping. I did sporadically see ping stop working to 2 of the 6 nodes that I was running on. I'm not sure if this is relevant or not, but in all cases that I observed the compute nodes that saw this problem have the very old InfiniHost III HCA. When this problem happens I am able to login to the node over the ethernet interface and don't see anything wrong. The CPU is busy, but the job is configured to not use all the cores so there are plenty of idle cycles, system memory isn't starved, and the arp tables for the IB subnet look fine. I was even able to run tcpdump on the node and noted that it was not seeing any packets coming from a host that wasn't able to ping it, but if I did a ping from the compute node to the login node then the login node was seeing the ping requests and sending replies, but they weren't arriving. There was however other traffic from other hosts on the fabric that appeared to be working fine. So, with the long description what I'm wondering is what can I look for that might be causing these nodes to lose IPoIB connectivity while some particular code is running on them? Is there a particular lower level connectivity checker that I should use? Other things to check at the OS level to see if some resource is full? Any pointers in the right direction would be much appreciated. Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: