[openib-general] Unreliable OpemSM failover
Venkatesh Babu
venkatesh.babu at 3leafnetworks.com
Fri Dec 8 18:25:20 PST 2006
Hal Rosenstock wrote:
>Was this the same scenario or something different ?
>
>
I had killed the previous OpenSM instance. So I lost that information.
It is the same OpenSM failover issue and using the exact same setup and
scripts to reproduce. It another instance of the problem.
>So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
>that right ?
>
>
>
Yes, that is right. They are the OUI vendors for the IB HCAs.
>Does this correspond to when node 2 SM goes down, SM comes up, or
>something else ?
>
>
I don't know the exact sequence when this message is displayed. All I
can say is that it was the last message printed by the OpenSM. I am not
rebooting the node 1 or killing the OpenSM. It is staying constant.
I have a script to reboot node 2 every couple of minutes. It will
stop rebooting if it finds one of these conditions -
1. SM1 on port1 is master but SM2 on port2 is not master
2. SM2 on port2 is master but SM1 on port1 is not master
3. Port1/2 is not ACTIVE
4. Port1/2's sm_lid/port lid is zero
I am capturing this all the output at the end of the test when the
script was terminated.
>Not sure why OpenSM decides to exit (due to this error which should be
>recoverable). It then fails to exit (hangs) as the other threads are not
>terminated.
>
>Is osm_exit_flag set ? I presume it is but would like verification.
>What are the thread_state values of the various threads ?
>
>
Unfortunately someone powerd off Node1, while I was debugging. So I
can not findout this.
On Node2 :
(gdb) p osm_exit_flag
$1 = 0
How do I findout the thread_state value ?
>>Node 2:
>>======
>>
>>
>
>Is this when node 2 comes back up and SM is restarted on both ports or
>is it after the SM is stopped on port 2 ?
>
>
>
As I said earlier, this is the snapshot when the script is stopped
rebooting as I described above.
>> port: 2
>> state: PORT_INIT (2)
>> max_mtu: 2048 (4)
>> active_mtu: 2048 (4)
>> sm_lid: 4
>>
>>
>
>This port still points at the SM on node 1, right ?
>
>
Yes that is right.
>
>
>> port_lid: 2
>> port_lmc: 0x00
>>
>>
>>0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
>>0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
>>Dec 07 11:29:03 817752 [0000] -> Exiting SM
>>
>>
>
>You stopped this SM, right ?
>
>
No I didn't stop the SM.
>>[root at localhost ~]#
>>[root at localhost ~]# gdb /var/log/opensm2.log 6867
>>
>>
>
>Why gdb this node's SM ? I'm not following you.
>
>Should point at executable not log.
>
>
You are right. It is a cut and paste error.
VBabu
More information about the general
mailing list