[ofw] crash while disabling HCA on multihome machine

Leonid Keller leonid at mellanox.co.il
Thu Mar 5 07:16:47 PST 2009


Crash:
    BugCheck 18, {bad0b0b0, fffffa800a3f7a90, 2, ffffffffffffffff}
The reference count of an object is illegal for the current state of the
object.
 
Setup: 
    Two HCAs, IB full stack + the patch, removing the registration HCA
with IBAL.
The problem doesn't happen without WinVerbs and WinMad.
 
Reproduce:
    1. Disable/enable HCA0. 
[    2. Disable/enable HCA0. ]
    3. Disable/enable HCA1.
 
Quick Analysis:
 
0: kd> !analyze -v
 
REFERENCE_BY_POINTER (18)
Arguments:
Arg1: 00000000bad0b0b0, Object type of the object whose reference count
is being lowered
Arg2: fffffa800a3f7a90, Object whose reference count is being lowered
Arg3: 0000000000000002, Reserved
Arg4: ffffffffffffffff, Reserved

The ill-dereferenced object in question is IBBUS.SYS
 
0: kd> !devobj fffffa800a3f7a90
Device object (fffffa800a3f7a90) is for:
  \Driver\ibbus DriverObject fffffa800a3f65d0

The wrong reference is PointerCount
 
0: kd> !object fffffa800a3f7a90
Object: fffffa800a3f7a90  Type: (bad0b0b0) 
    ObjectHeader: fffffa800a3f7a60 (old version)
    HandleCount: 0  PointerCount: 4294967295    /* it's -1 */
    Directory Object: fffffa800a4ab740  Name: 

More analysis:
I've got a feeling, that one of the WinVerbs&WinMad references wrong
IBBUS. My guess, it is WinMad.
 
Do the following.
Reload the machine (with two cards), enter the debugger and look to the
device stacks:
 
HCA0:
3: kd> !devstack 0xfffffa800a2cc060
  !DevObj   !DrvObj            !DevExt   ObjectName
  fffffa800a3ede20  \Driver\WinMad     fffffa800a3ec390  
  fffffa800a3ebc70  \Driver\WinVerbs   fffffa800a3eadb0  
  fffffa800a3ea040  \Driver\ibbus      fffffa800a3ea190  
  fffffa800a3e9460  \Driver\mlx4_hca   fffffa800a3e95b0  
> fffffa800a2cc060  \Driver\mlx4_bus   fffffa800a2caec0  00000055

Look at PointerCount of IBBUS0 - it is 2.
 
3: kd> !object fffffa800a3ea040  
Object: fffffa800a3ea040  Type: (fffffa8006a22840) Device
    ObjectHeader: fffffa800a3ea010 (old version)
    HandleCount: 0  PointerCount: 2
 
 
Now, PointerCount of IBBUS1 (IBBUS for HCA1) is 4.
 
HCA1:
3: kd> !devstack 0xfffffa8008b2b950
  !DevObj   !DrvObj            !DevExt   ObjectName
  fffffa800a3e9e20  \Driver\WinMad     fffffa800a3e5390  
  fffffa800a3df800  \Driver\WinVerbs   fffffa800a3e7570  
  fffffa800a3e3600  \Driver\ibbus      fffffa800a3e3750  ibal
  fffffa800a3e4040  \Driver\mlx4_hca   fffffa800a3e4190  
> fffffa8008b2b950  \Driver\mlx4_bus   fffffa8008b2b390  00000054

3: kd> !object fffffa800a3e3600
Object: fffffa800a3e3600  Type: (fffffa8006a22840) Device
    ObjectHeader: fffffa800a3e35d0 (old version)
    HandleCount: 0  PointerCount: 4
 
What happens during the reproducing of the crash ?
When you disable HCA0, it decrements IBBUS1' PointerCount to 3.
When you then disable HCA1, IBBUS1' PointerCount becomes -1 and you get
bugcheck 0x0018.
 
I didn't have time to continue the investigation.
Maybe you can look into it ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090305/0fb016d6/attachment.html>


More information about the ofw mailing list