[ofw] RE: HPC head-node slow down when OpenSM is started on the head-node (RC4, svn.1691).

Tue Oct 28 01:45:31 PDT 2008

 see inline

> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com] 
> Sent: Tuesday, October 28, 2008 2:11 AM
> To: Tzachi Dar; Leonid Keller; Fab Tillier
> Cc: Ishai Rabinovitz; ofw at lists.openfabrics.org
> Subject: RE: HPC head-node slow down when OpenSM is started 
> on the head-node (RC4,svn.1691).
> 
> Tzachi Dar wrote:
> > Hi,
> >
> > This is a bug that can come from a few reasons (see bellow) and we 
> > have been able to reproduce it here although we are still having 
> > different issues. If you can find out how to reproduce it without 
> > another SM that could be great.
> 
> The easy way to reproduce the HPC head-node failure is to do 
> an install of WinOF RC4 and select the 'OpenSM Started' 
> feature. As soon as the .msi installation is finished, you 
> are in the slow-down mode.
> 
> 
> >
> > As to your problem: Generally speaking the code is stacked at 
> > ipoib_port_up. There is an assumption that ipoib_port_up is called 
> > after ipoib_port_down. 99% that you have found a flow in 
> which this is 
> > not the case. On port down we close the QPs so that all 
> packets that 
> > have been placed for receive are freed.
> 
> Is there a possible race condition where Receive buffers can 
> be posted before the port_up() PNP call fires?
> 
> Actually the code almost seems incorrect w.r.t. 
> recv_mgr.depth, in that the orginal code wants recv_mgr.depth 
> == 0, where the initial startup value is 
> p_port->p_adapter->params.rq_depth?
> 
> >>         while( p_port->recv_mgr.depth || p_port->send_mgr.depth )
> >>                 cl_thread_suspend( 0 );
> 

(Leonid) as far as I see, params.rq_depth is the max value for
recv_mgr.depth.
The receive queue gets refilled in __recv_mgr_repost() till this depth.

> becomes
>         // wait for sends to finish and all recv buffers reposted.
>          while( (p_port->recv_mgr.depth != 
> p_port->p_adapter->params.rq_depth)
>                         || p_port->send_mgr.depth )
>                  cl_thread_suspend( 0 );
> 
> >
> > Taking one step up, it seems that we are having a problem 
> in the plug 
> > and play mechanism (IBAL). In general you are being called from the 
> > function __ipoib_pnp_cb. This function is quite complicated, as it 
> > takes into account the current state of ipoib, and also the 
> different 
> > events.
> >
> > In order for me to have more information, pleases send me a 
> log with 
> > printing at the following places:
> > Please add their the following print at __ipoib_pnp_cb (As soon as 
> > possibale):
> >       IPOIB_PRINT( TRACE_LEVEL_ERROR, IPOIB_DBG_PNP,
> >               ("p_pnp_rec->pnp_event = 0x%x (%s) object state %s\n",
> >               p_pnp_rec->pnp_event, ib_get_pnp_event_str( 
> > p_pnp_rec->pnp_event ), ib_get_pnp_event_str(Adapter->state)) );
> >
> > On the exit from this function please also print the same 
> line (I want 
> > to see the state changes).
> >
> > Please add a printing in ipoib_port_up, ipoib_port_down ,
> >
> > And send us the log. I hope that I'll be able to figure 
> what is going 
> > on there.
> 
> Will do so.
> 
> 
> >
> > And again a simple repro will help (probably even more).
> >
> > Thanks
> > Tzachi
> >
> >
> >
> >> -----Original Message-----
> >> From: Smith, Stan [mailto:stan.smith at intel.com]
> >> Sent: Monday, October 27, 2008 10:14 PM
> >> To: Smith, Stan; Tzachi Dar; Leonid Keller; Fab Tillier
> >> Cc: Ishai Rabinovitz; ofw at lists.openfabrics.org
> >> Subject: RE: HPC head-node slow down when OpenSM is started on the 
> >> head-node (RC4,svn.1691).
> >>
> >> Hello,
> >>  Further debug operations with src mods, I see the 
> offending call to 
> >> ipoib_port_up() with a valid port_p pointer. Where I see 
> the failure 
> >> is in ipoib_port.c @ ~line 1584
> >>
> >>         /* Wait for all work requests to get flushed. */
> >>         while( p_port->recv_mgr.depth || p_port->send_mgr.depth )
> >>                 cl_thread_suspend( 0 );
> >>
> >> recv_mgr.depth == 512, send_mgr.depth = 0
> >>
> >> Thoughts on why recv_mgr.depth would be so high?
> >> What is preventing the recv mgr from processing work requests?
> >>
> >>
> >> Bus_pnp.c
> >>   ExAcquireFastMutexUnsafe() --> ExAcquireFastMutex() + 
> ExRelease....
> >>
> >> Ipoib_port.c add_local locking added
> >>
> >> --- ipoib_port.c        2008-10-27 10:18:44.882358400 -0700
> >> +++ ipoib_port.c.new    2008-10-27 13:06:11.021042300 -0700 @@
> >>         -5303,9 +5303,7 @@ }
> >>
> >>         /* __endpt_mgr_insert expects *one* reference to 
> be held. */
> >> -       cl_atomic_inc( &p_port->endpt_rdr );
> >> -       status = __endpt_mgr_insert( p_port,
> >> p_port->p_adapter->params.conf_mac, p_endpt );
> >> -       cl_atomic_dec( &p_port->endpt_rdr );
> >> +       status = __endpt_mgr_insert_locked( p_port, 
> >> + p_port->p_adapter->params.conf_mac, p_endpt );
> >>         if( status != IB_SUCCESS )
> >>         {
> >>                 IPOIB_PRINT_EXIT( TRACE_LEVEL_ERROR, 
> IPOIB_DBG_ERROR,
> >>
> >>
> >> Stan.
> >>
> >> Smith, Stan wrote:
> >>> Hello All,
> >>>   Below are the results of snooping around with the debugger 
> >>> connected to the head-node which is operating in the 
> OpenSM induced 
> >>> slow-down mode.
> >>>
> >>> Interesting item is the call to ipoib_port_up() with p_port == 0, 
> >>> looks to be a problem; Clobbered stack?
> >>> The captured windbg story is attached.
> >>>
> >>> Possible results of not holding the port lock from __bcast_cb()?
> >>>
> >>> Please advise on further debug.
> >>>
> >>> Stan.
> >>>
> >>> nt!DbgBreakPointWithStatus
> >>> nt!wctomb+0x4cbf
> >>> nt!KeUpdateSystemTime+0x21f (TrapFrame @ fffffa60`022e9840) 
> >>> nt!KeReleaseInStackQueuedSpinLock+0x2d
> >>> nt!KeDelayExecutionThread+0x72c
> >>> ipoib!ipoib_port_up(struct _ipoib_port * p_port = 
> >>> 0x00000000`00000000, struct _ib_pnp_port_rec * p_pnp_rec =
> >>> 0xfffffa60`024be780)+0x79
> >>>
> >> 
> [f:\openib-windows-svn\wof2-0\rc4\trunk\ulp\ipoib\kernel\ipoib_port.c
> >>> @ 5186]
> >>> ipoib!__ipoib_pnp_cb(struct _ib_pnp_rec * p_pnp_rec = 
> >>> 0xfffffa60`024be780)+0x20d
> >>>
> >> 
> [f:\openib-windows-svn\wof2-0\rc4\trunk\ulp\ipoib\kernel\ipoib_adapte
> >> r
> >>> .c
> >>> @ 678]
> >>> ibbus!__pnp_notify_user(struct _al_pnp * p_reg = 
> >>> 0xfffffa80`05262d90, struct _al_pnp_context * p_context = 
> >>> 0xfffffa60`024be110, struct _al_pnp_ca_event * p_event_rec = 
> >>> 0xfffffa80`08b65108)+0x17b
> >>>
> >> [f:\openib-windows-svn\wof2-0\rc4\trunk\core\al\kernel\al_pnp.
> >> c @ 557]
> >>> ibbus!__pnp_process_port_forward(struct _al_pnp_ca_event * 
> >>> p_event_rec = 0x00000000`00000000)+0xa6 
> >>> [f:\openib-windows-svn\wof2-0\rc4\trunk\core\al\kernel\al_pnp.c @ 
> >>> 1279] ibbus!__pnp_check_ports(struct _al_ci_ca * p_ci_ca = 
> >>> 0xfffffa80`04bcc8c0, struct _ib_ca_attr * p_old_ca_attr = 
> >>> 0x00000000`00000001)+0x14d 
> >>> [f:\openib-windows-svn\wof2-0\rc4\trunk\core\al\kernel\al_pnp.c @ 
> >>> 1371] ibbus!__pnp_check_events(struct _cl_async_proc_item 
> * p_item = 
> >>> 0xfffffa80`04bc3e98)+0x171 
> >>> [f:\openib-windows-svn\wof2-0\rc4\trunk\core\al\kernel\al_pnp.c @ 
> >>> 1566] ibbus!__cl_async_proc_worker(void * context =
> >>> 0xfffffa80`04bc3d60)+0x61
> >>> 
> [f:\openib-windows-svn\wof2-0\rc4\trunk\core\complib\cl_async_proc.c
> >>> @ 153] ibbus!__cl_thread_pool_routine(void * context =
> >>> 0xfffffa80`04bc3860)+0x41
> >>> 
> [f:\openib-windows-svn\wof2-0\rc4\trunk\core\complib\cl_threadpool.c
> >>> @ 66] ibbus!__thread_callback(struct _cl_thread * p_thread =
> >>> 0x00380031`00430032)+0x28
> >>>
> >> 
> [f:\openib-windows-svn\wof2-0\rc4\trunk\core\complib\kernel\cl_thread.
> >>> c
> >>> @ 49]
> >>> nt!ProbeForRead+0xbd3
> >>> nt!_misaligned_access+0x4f6
> >>>
> >>>
> >>>
> >>>
> >>> Smith, Stan wrote:
> >>>> Gentlemen,
> >>>>   The HPC head-node slow down is back with a 
> vengeance.....instead 
> >>>> of only 24% of available Processor cycles we are now up to 31%.
> >>>> Needless to say the system is unusable. Along the path 
> of attaching 
> >>>> a debugger I've learned the slow-down is only caused by 
> OpenSM, as 
> >>>> I was able to shutdown OpenSM and the slow-down persisted.
> >>>>
> >>>> The story...
> >>>>
> >>>> A functional 15 node HPC system without OpenSM on the 
> head-node; SM 
> >>>> supplied by SilverStorm switch with embedded SM.
> >>>> On the head-node, Run Server manager, changing OpenSM 
> startup from 
> >>>> 'Disabled' to 'manual'. Disconnect Silver Storm IB 
> switch as it's 
> >>>> daisy-chained to the Mellanox switch which connects all 
> HPC nodes; 
> >>>> at this point no SM is running on the fabric. From the head-node 
> >>>> server manager, 'Start' OpenSM. Wait 20 seconds or so, 
> pop open the 
> >>>> task manager Performance view - notice large % of CPU 
> utilization.
> >>>> Once the system starts running slow....from the head-node server 
> >>>> manager, 'Stop' OpenSM. CPU utilization is still high.
> >>>> Reconnect SilverStorm switch + SM.
> >>>> CPU utilization is still high?
> >>>> Going to the head-node debugger, breaking in and showing 
> processes 
> >>>> and threads revealed little useful info?
> >>>> Debugger command suggestions ?
> >>>> Will try a checked version of ipoib.sys tomorrow.
> >>>>
> >>>> Stan.
> >>>>
> >>>> BTW, I did see a shutdown BSOD with a minidump that showed 
> >>>> ipoib!__cl_asynch_processor( 0 ) being the faulting call.
> >>>> Dereferencing the *context is what caused the BSOD.
> >>>>
> >>>>
> >>>> Smith, Stan wrote:
> >>>>> Hello,
> >>>>>
> >>>>> The good news is OpenSM is working nicely on all WinOS flavors.
> >>>>> The not so good news is OpenSM on the head-node of HPC consumes 
> >>>>> 25% of the system; win2k8 works fine running OpenSM.
> >>>>>
> >>>>> On our 15 node HPC cluster, if pre_RC4 OpenSM is 
> started during a 
> >>>>> WinOF install or if opensm is started on the head-node 
> after the 
> >>>>> WinOF install (OpenSM not started during the install), the 
> >>>>> task-bar network icon right-click and selection of Network and 
> >>>>> Sharing center fails to reach the Network and Sharing manger.
> >>>>> The best we see is the NSM GUI windows pop open and 
> remains blank 
> >>>>> (white). The rest of the system is functional in that command 
> >>>>> windows are OK, start menu OK but you are certain to 
> hang a window 
> >>>>> if you access the network via a GUI interface. A command window 
> >>>>> can set the IPoIB IPv4 address via net set address and ipconfig 
> >>>>> works? <Cntrl-Alt-Del>->resource manager shows about 25% of the 
> >>>>> system (4-cores) is running the NT kernel, followed by network 
> >>>>> services. I'm guessing massive amounts of system calls from a 
> >>>>> driver?
> >>>>>
> >>>>> We first started noticing similar behavior with RC2. Starting 
> >>>>> OpenSM during an install always failed (Caused system 
> slow-down).
> >>>>> Although if you started OpenSM after the install, the head-node 
> >>>>> was OK. RC3 behaved likewise. With pre_RC4 (svn.1661) the 
> >>>>> head-node now slows down when OpenSM is started after 
> the install 
> >>>>> or if OpenSM is started during the WinOF install.
> >>>>>
> >>>>> Again, all other WinOS flavors work fine with OpenSM started 
> >>>>> during the install or afterwards. HPC works fine with the 
> >>>>> SilverStorm embedded SM switch. I strongly suspect HPC 
> head-node 
> >>>>> would work fine if OpenSM were run from another Windows/Linux 
> >>>>> system.
> >>>>>
> >>>>> Thoughts or suggestions on further diagnosis as to why running 
> >>>>> OpenSM causes HPC head-node such a slow down? Part of the story 
> >>>>> may have something to do with the number of HPC compute nodes.
> >>>>>
> >>>>> Any chance you could run OpenSM on your HPC head node to see if 
> >>>>> you see similar behavior?
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Stan.
> 
>