[ofa-general] OpenSM and trap 128.
Chas Williams (CONTRACTOR)
chas at cmf.nrl.navy.mil
Thu Mar 26 07:56:08 PDT 2009
In message <49CB7723.9080104 at ext.bull.net>,Nicolas Morey Chaisemartin writes:
>We've noticed while setting up a new cluster a problem with OpenSM.
>As usual, there are some cable problems while plugging the cluster but one of
>the cable was changing state over 10 000 thousands times per second (OFF/ON) a
>nd sending each time a 128 trap to OpenSM.
>Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s or
> so).
>Fixing the cable will solve our problem, but I still think something should be
> done about this.
we have seen the same problem here locally. it seems to be a violation of
the spec to send this many traps per second.
>I was thinking about a solution:
>When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty
>GUID, lid or port guid.
>If last heavy sweep was triggered by the same faulty port, we wait twice last
>the last waiting time before forcing the new heavy sweep.
>If it's another source or another reason, we force the heavy sweep right then
>and set the waiting time to 0.
>
>This way, different problem will still trigger a heavy sweep asap but if only
>one faulty links triggers it it'll sweep less and less often as it is pretty u
>seless.
>
>It should solve this case but there may still be a problem when more ports hav
>e the same problem...
>
>Any idea on a way to manage this?
>An ignore mask on traps? (ignore traps for 1 specific problem for x seconds if
> they happen to often)
our solution was a custom patch (that might have made it into the opensm
distribution) called 'babbling_port_policy'. it attempted to disable the
port in question.
More information about the general
mailing list