[openib-general] Re: Another opensm problem ?
Hal Rosenstock
halr at voltaire.com
Sun Sep 25 02:22:52 PDT 2005
On Sat, 2005-09-24 at 16:43, Eitan Zahavi wrote:
> Well, if this is the case then OpenSM is might stop responding due to the following features:
> 1. We had in the past cases where bad hardware continuously flooded the SM with Traps.
> To protect against this kind of DOS attack we have implemented an adaptive filter in
> the SM trap receiver:
> If the exact same trap is received continuously from same source more then 10 times
> (with no more then of 5sec between the traps) they are considered DOS and are ignored.
> Please see osm_trap_rcv.c for details.
> 2. The way IB switches work is that each time a port of their changes state they:
> a. Set the "change bit" in the SwitchInfo
> b. Send a trap 128 to the SM. But Trap 128 does not carry the changed port number.
>
> So under a test case like you describe what can happen:
> 1. The SM decides to ignore trap 128 from the switch as more then 5 connect/reconnect sequences
> happen with not enough "quite" time to recover.
> 2. The SwitchInfo ChangeBit is sampled during the OSM light sweep. There is a race between the
> reading of the change bit and the clearing of it. If the connect disconnect happen very fast
> the change bit set by the re-connect can be cleaned by the clear starting by the disconnect.
>
> It is easy to see in the log file if the SM did ignore traps. Run with -V and look for:
> grep "Continuously received this trap" /var/log/osm.log
This is what is happening.
So the policy is 5 reconnect sequences without coming up ? What's not
quite enough time for recovery Is this settable ?
> (for some reason I did not get any log attachments with this thread - otherwise I would
> do some analysis on it too).
I will forward separately. This was too big for the list.
-- Hal
More information about the general
mailing list