[PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)

Ira Weiny weiny2 at llnl.gov
Mon Apr 28 09:19:23 PDT 2008


On Sun, 27 Apr 2008 11:47:54 +0300
Or Gerlitz <ogerlitz at voltaire.com> wrote:

> Ira Weiny wrote:
> >
> > I did not get any output with multicast_debug_level!  
> why should you, as from the node's point of view nothing has happened 
> (the exact param name is mcast_debug_level)
> >
> > Here is a patch which fixes the problem.  (At least with the partial sub-nets
> > configuration I explained before.)  I will have to verify this fixes the problem
> > I originally reported.
> OK, good. Does this problem exist in the released openSM? if yes, what 
> would be the trigger for the SM to "really discover" (i.e do PortInfo 
> SET) this sub-fabric and how much time would it take to reach this 
> trigger, worst case wise?

Yes, this is in the current released version of OpenSM, AFAICT.  The trigger
is: the single link separating the partial sub net will come up and that trap
will cause OpenSM to resweep.  I believe this will happen on the next resweep
cycle which is by default 10 sec.  (But this is configurable.)  I don't think
there is an issue with allowing OpenSM to resweep as designed.

> 
> The failure configuration you have set to reproduce the problem is very 
> untypical, I think.

I agree.  I made a patch to turn off the processing of MAD's in the kernel to
test my original theory, that the node is not responding to MAD's.  Using this
patch I have been able to verify that if a node stops responding that the rereg
is sent by OpenSM when the node comes back.

See my next email response to Sasha concerning the original issue.

Ira

>
> Since under common clos etc topologies which don't 
> have a 1:n blocking nature, failure of such link would cause re-route 
> etc by the SM which would not (and should not) be noted by the nodes (I 
> hope I am not falling into another problem here...)
> 
> Or.
> 
> 
> 



More information about the general mailing list