[openib-general] crash in mthca soon after loading drivers

Sean Hefty mshefty at ichips.intel.com
Wed Dec 8 17:12:16 PST 2004


Sean Hefty wrote:
> I'm getting the following bug in mthca when loading the drivers (core, 
> mad, and mthca).  The system is attached to a fabric with opensm running 
> on top of the Mellanox gold software stack.  I hit this when running 
> with the tip of openib.  Any help would be, well, helpful.
> 
> - Sean
> 
> 
> Dec  8 14:53:47 mshefty-linux2 kernel: kernel BUG at 
> drivers/infiniband/hw/mthca/mthca_cmd.c:328!

I still need to spend more time investigating this, but looking at 
mthca_cmd_wait():

	if (down_interruptible(&dev->cmd.event_sem))
		return -EINTR;

	spin_lock(&dev->cmd.context_lock);
	BUG_ON(dev->cmd.free_head < 0);
	context = &dev->cmd.context[dev->cmd.free_head];
	dev->cmd.free_head = context->next;
	spin_unlock(&dev->cmd.context_lock);

	...snip...

	wait_for_completion(&context->done);

	***** possible race here *****

	...snip...
out:
	spin_lock(&dev->cmd.context_lock);
	context->next = dev->cmd.free_head;
	dev->cmd.free_head = context - dev->cmd.context;
	spin_unlock(&dev->cmd.context_lock);

There appears to be a race here where event_sem can be incremented (in 
mthca_cmd_complete()), but free_head has not yet been updated.  A 
second call to mthca_cmd_wait could then get the semaphore, but find 
the list empty, leading to the bug.  In my case, max_cmd is set to 1.

I need to verify if this is indeed what is happening, and if so what to 
do to fix it.

- Sean




More information about the general mailing list