[openib-general] Unreliable OpemSM failover
Hal Rosenstock
halr at voltaire.com
Fri Dec 8 16:48:13 PST 2006
On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
>
> >And the two switches are not connected to each other, right ?
> >
> >
> Yes, the switches are not connected.
>
> >Do you set a different subnet prefix (other than the default on one) ?
> >Not sure if this matters yet in OpenIB but it might.
> >
> >
> I don't know how to set subnet prefix.
In opensm.opts file:
# Subnet prefix used on this subnet
subnet_prefix 0xfe80000000000000
(that's the default one)
> So it may be default one.
>
> >That's the main thread. It's in the following loop:
> >
> > while( !osm_exit_flag ) {
> > if (opt.console)
> > osm_console(&osm);
> > else
> > cl_thread_suspend( 10000 );
> >
> > if (osm_hup_flag) {
> > osm_hup_flag = 0;
> > /* a HUP signal should only start a new heavy sweep */
> > osm.subn.force_immediate_heavy_sweep = TRUE;
> > osm_opensm_sweep( &osm );
> > }
> >
> >What about the other threads ? What are they doing ?
> >
> >
> Yes. I got this. It was in this loop. I didn't realized there are
> other OpenSM threads running. I need to find that out.
OK.
> >I wouldn't expect that given the problem your hitting. The SUBNET UP
> >only occurs once the heavy sweep is completed. That's not happening.
> >
> >-- Hal
> >
> >
> Is the heavy sweep supposed to happen after the failover ?
The standby after determining that the master is non responsive will go
back to discovering but in your configuration will find no other SM and
will go to master. I think it got that far.
Once it transitions to master, it does a heavy sweep to configure the
subnet. Something is stopping that from completing. I'm not sure what is
going wrong.
> Is there any documentaion on the OpenSM architecture and design ?
Just the code AFAIK. You can read the SM and SA sections of IBA volume 1
for what an SM is supposed to do.
-- Hal
> VBabu
More information about the general
mailing list