[ofa-general] OpenSM and trap 128.

Nicolas Morey Chaisemartin nicolas.morey-chaisemartin at ext.bull.net
Thu Mar 26 08:20:57 PDT 2009


Hal Rosenstock wrote:
> 
>> Fixing the cable will solve our problem, but I still think something should be done about this.
>>
>> Though OpenSM behaviour was OK, it was really difficult to find where the performances problems came from.
> 
> There should be some log messages as to the trap rate being exceeded.
> Were they not present ? Which OpenSM version ?

Only message we had are the events on trap reception (so a real lots of them). 
However we didn't check that before spending quite some time trying to understand where performances loss could come from.
OpenSM is git head + Bull_patches on the top.

> 
>> All our diagnostics tools (mostly using infiniband diags) were failing to see the problem.
>> Infiniband diags commands fail toward the faulty port but it was hard to say if port was faulty or if it was due to high load on the SM and dropped VL15 messages.
> 
> Yes, the only thing you would observe is VL15 drops via perfquery. The
> SM is the one which should be logging the trap originator which is the
> way to diagnose this issue.
> 

It is actually. Though it's missing the port number in the log message.

Nicolas



More information about the general mailing list