[openib-general] Unreliable OpemSM failover
Hal Rosenstock
halr at voltaire.com
Sat Dec 9 04:12:39 PST 2006
On Fri, 2006-12-08 at 21:25, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
>
> >Was this the same scenario or something different ?
> >
> >
> I had killed the previous OpenSM instance. So I lost that information.
> It is the same OpenSM failover issue and using the exact same setup and
> scripts to reproduce. It another instance of the problem.
>
> >So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
> >that right ?
> >
> >
> >
> Yes, that is right. They are the OUI vendors for the IB HCAs.
>
> >Does this correspond to when node 2 SM goes down, SM comes up, or
> >something else ?
> >
> >
> I don't know the exact sequence when this message is displayed. All I
> can say is that it was the last message printed by the OpenSM. I am not
> rebooting the node 1 or killing the OpenSM. It is staying constant.
> I have a script to reboot node 2 every couple of minutes. It will
> stop rebooting if it finds one of these conditions -
> 1. SM1 on port1 is master but SM2 on port2 is not master
> 2. SM2 on port2 is master but SM1 on port1 is not master
> 3. Port1/2 is not ACTIVE
> 4. Port1/2's sm_lid/port lid is zero
Understood.
> I am capturing this all the output at the end of the test when the
> script was terminated.
>
> >Not sure why OpenSM decides to exit (due to this error which should be
> >recoverable). It then fails to exit (hangs) as the other threads are not
> >terminated.
> >
> >Is osm_exit_flag set ? I presume it is but would like verification.
> >What are the thread_state values of the various threads ?
> >
> >
> Unfortunately someone powerd off Node1, while I was debugging. So I
> can not findout this.
>
> On Node2 :
> (gdb) p osm_exit_flag
> $1 = 0
I was interested in the one on Node1 when it appeared to be trying to
exit (which it shouldn't be but is) and the other threads don't seem to
terminate.
> How do I findout the thread_state value ?
It's a variable in the SM structure (in the SM thread).
> >>Node 2:
> >>======
> >>
> >>
> >
> >Is this when node 2 comes back up and SM is restarted on both ports or
> >is it after the SM is stopped on port 2 ?
> >
> >
> >
> As I said earlier, this is the snapshot when the script is stopped
> rebooting as I described above.
>
> >> port: 2
> >> state: PORT_INIT (2)
> >> max_mtu: 2048 (4)
> >> active_mtu: 2048 (4)
> >> sm_lid: 4
> >>
> >>
> >
> >This port still points at the SM on node 1, right ?
> >
> >
> Yes that is right.
>
> >
> >
> >> port_lid: 2
> >> port_lmc: 0x00
> >>
> >>
> >>0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> >>0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
> >>Dec 07 11:29:03 817752 [0000] -> Exiting SM
> >>
> >>
> >
> >You stopped this SM, right ?
> >
> >
> No I didn't stop the SM.
>
> >>[root at localhost ~]#
> >>[root at localhost ~]# gdb /var/log/opensm2.log 6867
> >>
> >>
> >
> >Why gdb this node's SM ? I'm not following you.
> >
> >Should point at executable not log.
> >
> >
> You are right. It is a cut and paste error.
One more thing:
When you upgraded to OFED 1.2, did you build and install the management
libraries (libibcommon, libibumad are important here and libibmad for
diags) ?
-- Hal
>
> VBabu
More information about the general
mailing list