[openib-general] Unreliable OpemSM failover

Venkatesh Babu venkatesh.babu at 3leafnetworks.com
Fri Dec 8 18:25:20 PST 2006



Hal Rosenstock wrote:

>Was this the same scenario or something different ?
>  
>
I had killed the previous OpenSM instance. So I lost that information.
It is the same OpenSM failover issue and using the exact same setup and 
scripts to reproduce. It another instance of the problem. 

>So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
>that right ?
>
>  
>
  Yes, that is right. They are the OUI vendors for the IB HCAs.

>Does this correspond to when node 2 SM goes down, SM comes up, or
>something else ? 
>  
>
  I don't know the exact sequence when this message is displayed. All I 
can say is that it was the last message printed by the OpenSM. I am not 
rebooting the node 1 or  killing the OpenSM.  It is staying constant.
  I have a script to reboot node 2 every couple of minutes.  It  will 
stop rebooting  if it finds one of these conditions -
1.  SM1 on port1 is master but SM2 on port2 is not master
2. SM2 on port2 is master but SM1 on port1 is not master
3. Port1/2 is not ACTIVE
4. Port1/2's sm_lid/port lid is zero

  I am capturing this all the output at the end of the test when the 
script was terminated.

>Not sure why OpenSM decides to exit (due to this error which should be
>recoverable). It then fails to exit (hangs) as the other threads are not
>terminated. 
>
>Is osm_exit_flag set ? I presume it is but would like verification.
>What are the thread_state values of the various threads ?
>  
>
  Unfortunately someone powerd off Node1, while I was debugging. So I 
can not findout this.

  On Node2 :
(gdb) p osm_exit_flag
$1 = 0

  How do I findout the thread_state value ?

>>Node 2:
>>======
>>    
>>
>
>Is this when node 2 comes back up and SM is restarted on both ports or
>is it after the SM is stopped on port 2 ?
>
>  
>
   As I said earlier, this is the snapshot when the script is stopped 
rebooting as I described above.

>>                port:   2
>>                        state:                  PORT_INIT (2)
>>                        max_mtu:                2048 (4)
>>                        active_mtu:             2048 (4)
>>                        sm_lid:                 4
>>    
>>
>
>This port still points at the SM on node 1, right ?
>  
>
   Yes that is right.

>  
>
>>                        port_lid:               2
>>                        port_lmc:               0x00
>>
>>
>>0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
>>0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
>>Dec 07 11:29:03 817752 [0000] -> Exiting SM
>>    
>>
>
>You stopped this SM, right ?
>  
>
  No I didn't stop the SM.

>>[root at localhost ~]#
>>[root at localhost ~]# gdb /var/log/opensm2.log 6867
>>    
>>
>
>Why gdb this node's SM ? I'm not following you.
>
>Should point at executable not log.
>  
>
  You are right. It is a cut and paste error.

   VBabu




More information about the general mailing list