[ofw] System shutdown failure - living on the edge

Smith, Stan stan.smith at intel.com
Wed Apr 15 17:22:31 PDT 2009


Hello,
  I've been seeing a system shutdown failure off-n-on for a few weeks. The failure is due to the HCA driver being partially shutdown (*dev == NULL) when an AL async processing thread, which is unaware of system shutdown in progress, decides it's time to send/forward a MAD on QP 1/2.
This failure tends to occur most frequently in multi HCA systems, although on rare occasion I have seen it in a single HCA system.

How did we get here?

For System Shutdown processing Windows does not call the device removal functions used in disable/enable processing but only calls PNP power routines. The device removal routines shutdown IBAL if all HCAs are removed.
Ibbus fdo_set_power() basically passes the POWER IRP down the device stack to the HCA.
Consequently the IBAL async processing threads are merrily processing MADs.
On occasion, an IBAL special QP thread will attempt to send/forward a MAD during or after the POWER IRP has been handled by the HCA driver.....boom from a *dev being NULL.

What suggestions might you provide to repair this situation such that the IBAL async threads are shutdown prior to passing the POWER IRP down to the HCA driver.

Thanks,

Stan.

Windbg output from a system shutdown crash: dev == NULL

STACK_TEXT:
f5ffcba8 f59c278f 00000000 00000002 00000000 mthca!mthca_alloc_mailbox+0x2c [f:\openib-windows-svn\latest\gen1\trunk\hw\mthca\kernel\mthca_cmd.c @ 581]
f5ffcc08 f59c7650 00000000 00000000 00000000 mthca!mthca_MAD_IFC+0x1f [f:\openib-windows-svn\latest\gen1\trunk\hw\mthca\kernel\mthca_cmd.c @ 1753]
f5ffcc68 f59b1401 00000000 00000000 fed35701 mthca!mthca_process_mad+0x3a0 [f:\openib-windows-svn\latest\gen1\trunk\hw\mthca\kernel\mthca_mad.c @ 254]
f5ffccc4 f5cf5a3b ff764e10 82031801 fd23e658 mthca!mlnx_local_mad+0x1d1 [f:\openib-windows-svn\latest\gen1\trunk\hw\mthca\kernel\hca_verbs.c @ 1591]
f5ffccf4 f5cae250 81cf5cd0 ff250401 fd23e658 ibbus!al_local_mad+0x6bb [f:\openib-windows-svn\latest\gen1\trunk\core\al\al_mad.c @ 3252]
f5ffcd50 f5cadf0f fd957000 ff250468 00000001 ibbus!fwd_local_mad+0x270 [f:\openib-windows-svn\latest\gen1\trunk\core\al\kernel\al_smi.c @ 2289]
f5ffcd70 f5c99e56 fd979b08 00000001 fd979b08 ibbus!send_local_mad_cb+0xff [f:\openib-windows-svn\latest\gen1\trunk\core\al\kernel\al_smi.c @ 2449]
f5ffcd8c f5c9a029 8202a008 8202a008 00000000 ibbus!__cl_async_proc_worker+0x96 [f:\openib-windows-svn\latest\gen1\trunk\core\complib\cl_async_proc.c @ 153]
f5ffcda0 f5c9b34f 8202a008 f5ffcddc 80920833 ibbus!__cl_thread_pool_routine+0x59 [f:\openib-windows-svn\latest\gen1\trunk\core\complib\cl_threadpool.c @ 67]
f5ffcdac 80920833 82155930 00000000 00000000 ibbus!__thread_callback+0x2f [f:\openib-windows-svn\latest\gen1\trunk\core\complib\kernel\cl_thread.c @ 49]

FOLLOWUP_IP:
mthca!mthca_alloc_mailbox+2c [f:\trunk\hw\mthca\kernel\mthca_cmd.c @ 581]
f59ba7ac 8b8874020000    mov     ecx,dword ptr [eax+274h]

FAULTING_SOURCE_CODE:
   577:         mailbox = kmalloc(sizeof *mailbox, gfp_mask);
   578:         if (!mailbox)
   579:                 return ERR_PTR(-ENOMEM);
   580:
>  581:         mailbox->buf = pci_pool_alloc(dev->cmd.pool, gfp_mask, &mailbox->dma);
   582:         if (!mailbox->buf) {
   583:                 kfree(mailbox);
   584:                 return ERR_PTR(-ENOMEM);
   585:         }
   586:



More information about the ofw mailing list