[ofw] RE: Disconnection problem and AL reference

James Yang jyang at xsigo.com
Wed Oct 15 12:30:41 PDT 2008


Hi Fab,

After calling cm_dreq(), my callback for it is not called and the workitems are also never get called. After cm_dreq() time out, I call destroy_qp() with status successful. But the reference count of QP is always 1.

What area should I look for the problem?

Thanks,
James


-----Original Message-----
From: Fab Tillier [mailto:ftillier at windows.microsoft.com] 
Sent: Wednesday, October 15, 2008 10:23 AM
To: James Yang; ofw at lists.openfabrics.org
Subject: RE: Disconnection problem and AL reference

Hi James,

When you see this hang, is your QP object still alive?  What is its reference count?  If the cm_dreq function isn't returning that is likely where the problem lies.  I'd suggest looking at why.

-Fab

>From: ofw-bounces at lists.openfabrics.org
>[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of James Yang
>Sent: Tuesday, October 14, 2008 3:02 PM
>To: ofw at lists.openfabrics.org
>Subject: [ofw] Disconnection problem and AL reference
>
>Hi,
>
>Our driver product is based on WinOF1.1. Recently I saw a problem that Windows cannot shut down. The procedure and observation are as follows:
>
>Install the driver, when there is still some traffic going on, reboot the system.
>
>We do the following in our driver, and everything seems working until reboot.
>
>*         create_cq() :   one receive queue and one send queue, and set
>the callback function
>
>*         create_qp() with the above created queues, and set init state
>IB_QPS_INIT
>
>*         cm_req() with the QP and correct connection path
>
>*         post_recv() with 100 package buffer for receiving data
>
>*         post_send() when necessary
>
>
>Receive and send are fine with the respective callback invoked, whenever there is data activity.
>
>At certain point during shutdown, when we try to do cm_dreq() to initialize a disconnecting, the 100 receiving workitems are never being released, callback functions are never being called. If we continue to destroy QP, the final result is IB stack can't do its clean up work because it still holds some extra reference counter. Message similar to the following line shows up in debug version:
>
>[AL]print_al_obj() !ERROR!: AL object
>fffffadf379c8280(AL_OBJ_TYPE_H_AL),
>
>
>It seems the AL handle we open can't be destroyed. But I doubt maybe we already are in a bad state before that.
>
>Winddbg stack, this is on x64 Win2003 server:
>        fffffadf`2664e880 fffff800`01027682 nt!KiSwapContext+0x85
>        fffffadf`2664ea00 fffff800`0102828e nt!KiSwapThread+0x3c9
>        fffffadf`2664ea60 fffffadf`25ac7a3d
>nt!KeWaitForSingleObject+0x5a6
>        fffffadf`2664eae0 fffffadf`25b5fca8 ibbus!cl_event_wait_on+0x11d [c:\windows-openib\src\winib-1176g\core\complib\kernel\cl_event.c @ 59]
>        fffffadf`2664eb40 fffffadf`25b0013b ibbus!sync_destroy_obj+0x228 [c:\windows-openib\src\winib-1176g\core\al\al_common.c @ 513]
>        fffffadf`2664ebb0 fffffadf`25a1f8c7 ibbus!ib_close_al+0x3bb [c:\windows-openib\src\winib-1176g\core\al\al.c @ 89]
>        fffffadf`2664ec10 fffffadf`25a1b23f
>MyDriver!IBAccessLayer::Close+0x77
>
>The al handle ref_cnt is 1 here.
>
>Can anyone shed some light on this? Is this a known issue which is fixed in WinOF2.0 or is it an unknown problem?
>
>Thanks,
>James
>




More information about the ofw mailing list