[ofa-general] Re: Lost out-of-svc trap notifications during SM handover

Sasha Khapyorsky sashak at voltaire.com
Sun Nov 11 09:33:35 PST 2007


Hi Lan,

On 19:01 Fri 09 Nov     , Lan Tran wrote:
> 
> I'm seeing a problem with missing out-of-svc trap notifications when a Master SM port is disabled. I'm taking a look into it now, but if you have any pointers or ideas of what might be going on or how to resolve it, that would be much appreciated! 
> 
> I am subscribing to be informed of out-of-service trap events (i.e. trap 65), registering my own callback. When I disable an IB port of a remote node that is running the Standby SM, then, as expected, my trap callback function is called. But when I disable the IB port of the remote node that is the Master SM, my trap 65 callback is never called.  From looking at the opensm logs it seems what is happening is: 
> 1) I disable port running Master SM 
> 2) SM handover starts  
>    --> during Standby SM's heavy sweep, osm_drop_mgr_process() detects that the old Master SM port is down ... but at this point no subscribers to be informed because they are all subscribed with the old Master SM  
>    ---> Standby SM enters Master SM state, so now new Master SM  
> 3) Several seconds later, I subscribe with the new Master SM for trap 65 notification (I do this whenever I receive IB_EVENT_CLIENT_REREGISTER event), but this is too late as the report notice for the dropped old Master SM port already occurred earlier. 

Right, it is how things work now. Stand-by OpenSM doesn't track subnet
changes, so it will not send any notices on first sweep when becoming
master (OpenSM which is doing master->stand-by transition sends, but in
your case its port is disconnected).

> It seems I need to somehow make sure that I have subscribed for a trap 65 notification with the to-be new Master SM when it decides to report that the old Master SM port goes down. Not quite sure if this is possible though :) 

This will not help. OpenSM doesn't send in/out service traps at first
sweep. I don't see an easy solution here - we will need replicate SM and
SA databases somehow.

OTOH even then a trap can be lost due to transmission errors, etc..

Sasha



More information about the general mailing list