[ofa-general] OpenSM and trap 128.

Thu Mar 26 05:37:55 PDT 2009

Hi,

We've noticed while setting up a new cluster a problem with OpenSM.
As usual, there are some cable problems while plugging the cluster but one of the cable was changing state over 10 000 thousands times per second (OFF/ON) and sending each time a 128 trap to OpenSM.
Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s or so).
Fixing the cable will solve our problem, but I still think something should be done about this.

Though OpenSM behaviour was OK, it was really difficult to find where the performances problems came from. 
All our diagnostics tools (mostly using infiniband diags) were failing to see the problem.
Infiniband diags commands fail toward the faulty port but it was hard to say if port was faulty or if it was due to high load on the SM and dropped VL15 messages.

I was thinking about a solution:
When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty GUID, lid or port guid.
If last heavy sweep was triggered by the same faulty port, we wait twice last the last waiting time before forcing the new heavy sweep.
If it's another source or another reason, we force the heavy sweep right then and set the waiting time to 0.

This way, different problem will still trigger a heavy sweep asap but if only one faulty links triggers it it'll sweep less and less often as it is pretty useless.

It should solve this case but there may still be a problem when more ports have the same problem...

Any idea on a way to manage this?
An ignore mask on traps? (ignore traps for 1 specific problem for x seconds if they happen to often)

Thanks

Nicolas