[openib-general] Re: Another opensm problem ?

Hal Rosenstock halr at voltaire.com
Sun Sep 25 02:22:52 PDT 2005


On Sat, 2005-09-24 at 16:43, Eitan Zahavi wrote:
> Well, if this is the case then OpenSM is might stop responding due to the following features:
> 1. We had in the past cases where bad hardware continuously flooded the SM with Traps.
>     To protect against this kind of DOS attack we have implemented an adaptive filter in
>     the SM trap receiver:
>     If the exact same trap is received continuously from same source more then 10 times
>     (with no more then of 5sec between the traps) they are considered DOS and are ignored.
>     Please see osm_trap_rcv.c for details.
> 2. The way IB switches work is that each time a port of their changes state they:
>     a. Set the "change bit" in the SwitchInfo
>     b. Send a trap 128 to the SM. But Trap 128 does not carry the changed port number.
> 
> So under a test case like you describe what can happen:
> 1. The SM decides to ignore trap 128 from the switch as more then 5 connect/reconnect sequences
>     happen with not enough "quite" time to recover.
> 2. The SwitchInfo ChangeBit is sampled during the OSM light sweep. There is a race between the
>     reading of the change bit and the clearing of it. If the connect disconnect happen very fast
>     the change bit set by the re-connect can be cleaned by the clear starting by the disconnect.
> 
> It is easy to see in the log file if the SM did ignore traps. Run with -V and look for:
> grep "Continuously received this trap" /var/log/osm.log

This is what is happening.

So the policy is 5 reconnect sequences without coming up ? What's not
quite enough time for recovery  Is this settable ?

> (for some reason I did not get any log attachments with this thread - otherwise I would
> do some analysis on it too).

I will forward separately. This was too big for the list.

-- Hal






More information about the general mailing list