***SPAM*** Re: [ofa-general] OpenSM and trap 128.
    Eli Dorfman (Voltaire) 
    dorfman.eli at gmail.com
       
    Sun Mar 29 05:18:31 PDT 2009
    
    
  
Nicolas Morey Chaisemartin wrote:
> Hi,
> 
> We've noticed while setting up a new cluster a problem with OpenSM.
> As usual, there are some cable problems while plugging the cluster but one of the cable was changing state over 10 000 thousands times per second (OFF/ON) and sending each time a 128 trap to OpenSM.
> Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s or so).
> Fixing the cable will solve our problem, but I still think something should be done about this.
> 
> Though OpenSM behaviour was OK, it was really difficult to find where the performances problems came from. 
> All our diagnostics tools (mostly using infiniband diags) were failing to see the problem.
> Infiniband diags commands fail toward the faulty port but it was hard to say if port was faulty or if it was due to high load on the SM and dropped VL15 messages.
> 
> I was thinking about a solution:
> When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty GUID, lid or port guid.
> If last heavy sweep was triggered by the same faulty port, we wait twice last the last waiting time before forcing the new heavy sweep.
> If it's another source or another reason, we force the heavy sweep right then and set the waiting time to 0.
Note that trap 128 is generated by a switch while reporting that one of his ports has changed.
The changed port GUID/LID is not reported in the trap.
You can change sweep_on_trap option in opensm.conf to FALSE.
This should stop opensm heavy sweeps.
> 
> This way, different problem will still trigger a heavy sweep asap but if only one faulty links triggers it it'll sweep less and less often as it is pretty useless.
> 
> It should solve this case but there may still be a problem when more ports have the same problem...
> 
> Any idea on a way to manage this?
> An ignore mask on traps? (ignore traps for 1 specific problem for x seconds if they happen to often)
> 
> Thanks
> 
> Nicolas
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
    
    
More information about the general
mailing list