[ofa-general] hca sma non-responsive but link still Active

Chas Williams (CONTRACTOR) chas at cmf.nrl.navy.mil
Thu Dec 4 09:06:27 PST 2008


if i load the attached module on my host, the link winds up in a curious
state.  the intent of the module is to duplicate a particular type of
kernel hang that blocks all the cpus from handling any work.

what happens is that the sma stops responding:

	# ibportstate  90 1
	ibportstate: iberror: failed: smp query nodeinfo failed

but the switch port on the other end of the link still reports a valid
state:

	# ibportstate  70 18
	PortInfo:
	# Port info: Lid 70 port 18
	LinkState:.......................Active
	PhysLinkState:...................LinkUp
	LinkWidthSupported:..............1X or 4X
	LinkWidthEnabled:................1X or 4X
	LinkWidthActive:.................4X
	LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
	LinkSpeedEnabled:................2.5 Gbps
	LinkSpeedActive:.................2.5 Gbps
	ibwarn: [6758] _do_madrpc: recv failed: Connection timed out
	ibportstate: iberror: failed: smp query nodeinfo failed

we believe that the link layer is handled entirely in the firmware 
which has no idea that the sma part in the kernel has gone to sleep.
the periodic light sweeps by the opensm dont seem to discover this
problem either.

this type of failure tends to make the ib utilities that scan the network
run rather slowly.  ibdiagnet does indeed spot this broken host, but 
perhaps the sm could be extended to attempt to something about this 
host, like reset the switch port?  should it really require manual
intervention to clear this error?

/* doom.c -- reliably wedge an smp kernel 
 *
 * build:
 *        echo 'obj-m   += doom.o' > Makefile
 *        make -C /lib/modules/`uname -r`/build M=`pwd`
 *
 * usage:
 *        insmod doom.ko
 */

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/spinlock.h>
#include <linux/smp.h>

static void wedge(void *data)
{
	unsigned long flags;
	spinlock_t lock;

	printk(KERN_ERR "goodbye cruel world...\n");

	spin_lock_init(&lock);
	spin_lock_irqsave(&lock, flags);

	while (1)
		/* do nothing */;
}

static int __init doom_init(void)
{
	int i;

	for_each_possible_cpu(i) {
		if (i != smp_processor_id())
			smp_call_function_single(i, wedge, 0, 0, 0);
	}

	smp_call_function_single(smp_processor_id(), wedge, 0, 0, 0);

	return 0;
}

module_init(doom_init);

MODULE_AUTHOR("chas williams <chas at cmf.nrl.navy.mil>");
MODULE_DESCRIPTION("wedge the kernel but good");
MODULE_LICENSE("GPL");



More information about the general mailing list