[ofa-general] OpenSM and trap 128.

Thu Mar 26 07:56:08 PDT 2009

In message <49CB7723.9080104 at ext.bull.net>,Nicolas Morey Chaisemartin writes:
>We've noticed while setting up a new cluster a problem with OpenSM.
>As usual, there are some cable problems while plugging the cluster but one of 
>the cable was changing state over 10 000 thousands times per second (OFF/ON) a
>nd sending each time a 128 trap to OpenSM.
>Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s or
> so).
>Fixing the cable will solve our problem, but I still think something should be
> done about this.

we have seen the same problem here locally.  it seems to be a violation of
the spec to send this many traps per second.

>I was thinking about a solution:
>When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty 
>GUID, lid or port guid.
>If last heavy sweep was triggered by the same faulty port, we wait twice last 
>the last waiting time before forcing the new heavy sweep.
>If it's another source or another reason, we force the heavy sweep right then 
>and set the waiting time to 0.
>
>This way, different problem will still trigger a heavy sweep asap but if only 
>one faulty links triggers it it'll sweep less and less often as it is pretty u
>seless.
>
>It should solve this case but there may still be a problem when more ports hav
>e the same problem...
>
>Any idea on a way to manage this?
>An ignore mask on traps? (ignore traps for 1 specific problem for x seconds if
> they happen to often)

our solution was a custom patch (that might have made it into the opensm
distribution) called 'babbling_port_policy'.  it attempted to disable the
port in question.