[ofw] RE: HPC head-node slow down when OpenSM is started on the head-node (RC4,svn.1686).
Smith, Stan
stan.smith at intel.com
Fri Oct 24 13:00:11 PDT 2008
Hello All,
Below are the results of snooping around with the debugger connected to the head-node which is operating in the OpenSM induced slow-down mode.
Interesting item is the call to ipoib_port_up() with p_port == 0, looks to be a problem; Clobbered stack?
The captured windbg story is attached.
Possible results of not holding the port lock from __bcast_cb()?
Please advise on further debug.
Stan.
nt!DbgBreakPointWithStatus
nt!wctomb+0x4cbf
nt!KeUpdateSystemTime+0x21f (TrapFrame @ fffffa60`022e9840)
nt!KeReleaseInStackQueuedSpinLock+0x2d
nt!KeDelayExecutionThread+0x72c
ipoib!ipoib_port_up(struct _ipoib_port * p_port = 0x00000000`00000000, struct _ib_pnp_port_rec * p_pnp_rec = 0xfffffa60`024be780)+0x79 [f:\openib-windows-svn\wof2-0\rc4\trunk\ulp\ipoib\kernel\ipoib_port.c @ 5186]
ipoib!__ipoib_pnp_cb(struct _ib_pnp_rec * p_pnp_rec = 0xfffffa60`024be780)+0x20d [f:\openib-windows-svn\wof2-0\rc4\trunk\ulp\ipoib\kernel\ipoib_adapter.c @ 678]
ibbus!__pnp_notify_user(struct _al_pnp * p_reg = 0xfffffa80`05262d90, struct _al_pnp_context * p_context = 0xfffffa60`024be110, struct _al_pnp_ca_event * p_event_rec = 0xfffffa80`08b65108)+0x17b [f:\openib-windows-svn\wof2-0\rc4\trunk\core\al\kernel\al_pnp.c @ 557]
ibbus!__pnp_process_port_forward(struct _al_pnp_ca_event * p_event_rec = 0x00000000`00000000)+0xa6 [f:\openib-windows-svn\wof2-0\rc4\trunk\core\al\kernel\al_pnp.c @ 1279]
ibbus!__pnp_check_ports(struct _al_ci_ca * p_ci_ca = 0xfffffa80`04bcc8c0, struct _ib_ca_attr * p_old_ca_attr = 0x00000000`00000001)+0x14d [f:\openib-windows-svn\wof2-0\rc4\trunk\core\al\kernel\al_pnp.c @ 1371]
ibbus!__pnp_check_events(struct _cl_async_proc_item * p_item = 0xfffffa80`04bc3e98)+0x171 [f:\openib-windows-svn\wof2-0\rc4\trunk\core\al\kernel\al_pnp.c @ 1566]
ibbus!__cl_async_proc_worker(void * context = 0xfffffa80`04bc3d60)+0x61 [f:\openib-windows-svn\wof2-0\rc4\trunk\core\complib\cl_async_proc.c @ 153]
ibbus!__cl_thread_pool_routine(void * context = 0xfffffa80`04bc3860)+0x41 [f:\openib-windows-svn\wof2-0\rc4\trunk\core\complib\cl_threadpool.c @ 66]
ibbus!__thread_callback(struct _cl_thread * p_thread = 0x00380031`00430032)+0x28 [f:\openib-windows-svn\wof2-0\rc4\trunk\core\complib\kernel\cl_thread.c @ 49]
nt!ProbeForRead+0xbd3
nt!_misaligned_access+0x4f6
Smith, Stan wrote:
> Gentlemen,
> The HPC head-node slow down is back with a vengeance.....instead of
> only 24% of available Processor cycles we are now up to 31%. Needless
> to say the system is unusable. Along the path of attaching a debugger
> I've learned the slow-down is only caused by OpenSM, as I was able to
> shutdown OpenSM and the slow-down persisted.
>
> The story...
>
> A functional 15 node HPC system without OpenSM on the head-node; SM
> supplied by SilverStorm switch with embedded SM.
> On the head-node, Run Server manager, changing OpenSM startup from
> 'Disabled' to 'manual'.
> Disconnect Silver Storm IB switch as it's daisy-chained to the
> Mellanox switch which connects all HPC nodes; at this point no SM is
> running on the fabric.
> From the head-node server manager, 'Start' OpenSM.
> Wait 20 seconds or so, pop open the task manager Performance view -
> notice large % of CPU utilization.
> Once the system starts running slow....from the head-node server
> manager, 'Stop' OpenSM.
> CPU utilization is still high.
> Reconnect SilverStorm switch + SM.
> CPU utilization is still high?
> Going to the head-node debugger, breaking in and showing processes
> and threads revealed little useful info?
> Debugger command suggestions ?
> Will try a checked version of ipoib.sys tomorrow.
>
> Stan.
>
> BTW, I did see a shutdown BSOD with a minidump that showed
> ipoib!__cl_asynch_processor( 0 ) being the faulting call.
> Dereferencing the *context is what caused the BSOD.
>
>
> Smith, Stan wrote:
>> Hello,
>>
>> The good news is OpenSM is working nicely on all WinOS flavors. The
>> not so good news is OpenSM on the head-node of HPC consumes 25% of
>> the system; win2k8 works fine running OpenSM.
>>
>> On our 15 node HPC cluster, if pre_RC4 OpenSM is started during a
>> WinOF install or if opensm is started on the head-node after the
>> WinOF install (OpenSM not started during the install), the task-bar
>> network icon right-click and selection of Network and Sharing center
>> fails to reach the Network and Sharing manger.
>> The best we see is the NSM GUI windows pop open and remains blank
>> (white). The rest of the system is functional in that command windows
>> are OK, start menu OK but you are certain to hang a window if you
>> access the network via a GUI interface. A command window can set the
>> IPoIB IPv4 address via net set address and ipconfig works?
>> <Cntrl-Alt-Del>->resource manager shows about 25% of the system
>> (4-cores) is running the NT kernel, followed by network services. I'm
>> guessing massive amounts of system calls from a driver?
>>
>> We first started noticing similar behavior with RC2. Starting OpenSM
>> during an install always failed (Caused system slow-down). Although
>> if you started OpenSM after the install, the head-node was OK.
>> RC3 behaved likewise.
>> With pre_RC4 (svn.1661) the head-node now slows down when OpenSM is
>> started after the install or if OpenSM is started during the WinOF
>> install.
>>
>> Again, all other WinOS flavors work fine with OpenSM started during
>> the install or afterwards. HPC works fine with the SilverStorm
>> embedded SM switch. I strongly suspect HPC head-node would work fine
>> if OpenSM were run from another Windows/Linux system.
>>
>> Thoughts or suggestions on further diagnosis as to why running OpenSM
>> causes HPC head-node such a slow down? Part of the story may have
>> something to do with the number of HPC compute nodes.
>>
>> Any chance you could run OpenSM on your HPC head node to see if you
>> see similar behavior?
>>
>> Thanks,
>>
>> Stan.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: slowdown-1.TXT
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20081024/2369e84e/attachment.ksh>
More information about the ofw
mailing list