[ofa-general] hca sma non-responsive but link still Active
Chas Williams (CONTRACTOR)
chas at cmf.nrl.navy.mil
Thu Dec 4 09:06:27 PST 2008
if i load the attached module on my host, the link winds up in a curious
state. the intent of the module is to duplicate a particular type of
kernel hang that blocks all the cpus from handling any work.
what happens is that the sma stops responding:
# ibportstate 90 1
ibportstate: iberror: failed: smp query nodeinfo failed
but the switch port on the other end of the link still reports a valid
state:
# ibportstate 70 18
PortInfo:
# Port info: Lid 70 port 18
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
LinkSpeedEnabled:................2.5 Gbps
LinkSpeedActive:.................2.5 Gbps
ibwarn: [6758] _do_madrpc: recv failed: Connection timed out
ibportstate: iberror: failed: smp query nodeinfo failed
we believe that the link layer is handled entirely in the firmware
which has no idea that the sma part in the kernel has gone to sleep.
the periodic light sweeps by the opensm dont seem to discover this
problem either.
this type of failure tends to make the ib utilities that scan the network
run rather slowly. ibdiagnet does indeed spot this broken host, but
perhaps the sm could be extended to attempt to something about this
host, like reset the switch port? should it really require manual
intervention to clear this error?
/* doom.c -- reliably wedge an smp kernel
*
* build:
* echo 'obj-m += doom.o' > Makefile
* make -C /lib/modules/`uname -r`/build M=`pwd`
*
* usage:
* insmod doom.ko
*/
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/spinlock.h>
#include <linux/smp.h>
static void wedge(void *data)
{
unsigned long flags;
spinlock_t lock;
printk(KERN_ERR "goodbye cruel world...\n");
spin_lock_init(&lock);
spin_lock_irqsave(&lock, flags);
while (1)
/* do nothing */;
}
static int __init doom_init(void)
{
int i;
for_each_possible_cpu(i) {
if (i != smp_processor_id())
smp_call_function_single(i, wedge, 0, 0, 0);
}
smp_call_function_single(smp_processor_id(), wedge, 0, 0, 0);
return 0;
}
module_init(doom_init);
MODULE_AUTHOR("chas williams <chas at cmf.nrl.navy.mil>");
MODULE_DESCRIPTION("wedge the kernel but good");
MODULE_LICENSE("GPL");
More information about the general
mailing list