[openib-general] Unreliable OpemSM failover

Sat Dec 9 04:12:39 PST 2006

On Fri, 2006-12-08 at 21:25, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >Was this the same scenario or something different ?
> >  
> >
> I had killed the previous OpenSM instance. So I lost that information.
> It is the same OpenSM failover issue and using the exact same setup and 
> scripts to reproduce. It another instance of the problem. 
> 
> >So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
> >that right ?
> >
> >  
> >
>   Yes, that is right. They are the OUI vendors for the IB HCAs.
> 
> >Does this correspond to when node 2 SM goes down, SM comes up, or
> >something else ? 
> >  
> >
>   I don't know the exact sequence when this message is displayed. All I 
> can say is that it was the last message printed by the OpenSM. I am not 
> rebooting the node 1 or  killing the OpenSM.  It is staying constant.
>   I have a script to reboot node 2 every couple of minutes.  It  will 
> stop rebooting  if it finds one of these conditions -
> 1.  SM1 on port1 is master but SM2 on port2 is not master
> 2. SM2 on port2 is master but SM1 on port1 is not master
> 3. Port1/2 is not ACTIVE
> 4. Port1/2's sm_lid/port lid is zero

Understood.

>   I am capturing this all the output at the end of the test when the 
> script was terminated.
> 
> >Not sure why OpenSM decides to exit (due to this error which should be
> >recoverable). It then fails to exit (hangs) as the other threads are not
> >terminated. 
> >
> >Is osm_exit_flag set ? I presume it is but would like verification.
> >What are the thread_state values of the various threads ?
> >  
> >
>   Unfortunately someone powerd off Node1, while I was debugging. So I 
> can not findout this.
> 
>   On Node2 :
> (gdb) p osm_exit_flag
> $1 = 0

I was interested in the one on Node1 when it appeared to be trying to
exit (which it shouldn't be but is) and the other threads don't seem to
terminate.

>   How do I findout the thread_state value ?

It's a variable in the SM structure (in the SM thread).

> >>Node 2:
> >>======
> >>    
> >>
> >
> >Is this when node 2 comes back up and SM is restarted on both ports or
> >is it after the SM is stopped on port 2 ?
> >
> >  
> >
>    As I said earlier, this is the snapshot when the script is stopped 
> rebooting as I described above.
> 
> >>                port:   2
> >>                        state:                  PORT_INIT (2)
> >>                        max_mtu:                2048 (4)
> >>                        active_mtu:             2048 (4)
> >>                        sm_lid:                 4
> >>    
> >>
> >
> >This port still points at the SM on node 1, right ?
> >  
> >
>    Yes that is right.
> 
> >  
> >
> >>                        port_lid:               2
> >>                        port_lmc:               0x00
> >>
> >>
> >>0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
> >>0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
> >>Dec 07 11:29:03 817752 [0000] -> Exiting SM
> >>    
> >>
> >
> >You stopped this SM, right ?
> >  
> >
>   No I didn't stop the SM.
> 
> >>[root at localhost ~]#
> >>[root at localhost ~]# gdb /var/log/opensm2.log 6867
> >>    
> >>
> >
> >Why gdb this node's SM ? I'm not following you.
> >
> >Should point at executable not log.
> >  
> >
>   You are right. It is a cut and paste error.

One more thing:

When you upgraded to OFED 1.2, did you build and install the management
libraries (libibcommon, libibumad are important here and libibmad for
diags) ?

-- Hal

> 
>    VBabu