[openib-general] Re: Another opensm problem ?

Sat Sep 24 13:43:30 PDT 2005

Hi Viswa and Hal,

I have read through the thread and have few comments.

But first let me see if I understand the test run correctly. The test is as follows:
1. OpenSM starts up configuring the subnet.
2. Then the user ears up a cable and connects it to the other side port of a switch
3. The SM is supposed to bring up the new connection
4. Step 2 is repeated until the SM stops responding.

Well, if this is the case then OpenSM is might stop responding due to the following features:
1. We had in the past cases where bad hardware continuously flooded the SM with Traps.
    To protect against this kind of DOS attack we have implemented an adaptive filter in
    the SM trap receiver:
    If the exact same trap is received continuously from same source more then 10 times
    (with no more then of 5sec between the traps) they are considered DOS and are ignored.
    Please see osm_trap_rcv.c for details.
2. The way IB switches work is that each time a port of their changes state they:
    a. Set the "change bit" in the SwitchInfo
    b. Send a trap 128 to the SM. But Trap 128 does not carry the changed port number.

So under a test case like you describe what can happen:
1. The SM decides to ignore trap 128 from the switch as more then 5 connect/reconnect sequences
    happen with not enough "quite" time to recover.
2. The SwitchInfo ChangeBit is sampled during the OSM light sweep. There is a race between the
    reading of the change bit and the clearing of it. If the connect disconnect happen very fast
    the change bit set by the re-connect can be cleaned by the clear starting by the disconnect.

It is easy to see in the log file if the SM did ignore traps. Run with -V and look for:
grep "Continuously received this trap" /var/log/osm.log

(for some reason I did not get any log attachments with this thread - otherwise I would
do some analysis on it too).

Anyway, if the SM does not heavy sweep (due to the above) it is very likely it will continue to
poll the non existing node that was previously attached to a switch port with no success.

So testing of cable tear off and reconnect should be done with at least 10 seconds recovery time.
Also you could try sending kill -HUP to the OpenSM process and see if the full sweep you start
is able to bring all ports up.

Viswa, with all that said, it is very possible you are experiencing a bug in OpenSM and we
want to encourage your effort finding those. With your, and others, help we will be able to
flush them out.

Thanks

Eitan

Hal Rosenstock wrote:
> On Fri, 2005-09-23 at 14:57, Hal Rosenstock wrote:
> 
>>On Fri, 2005-09-23 at 13:50, Viswanath Krishnamurthy wrote:
>>
>>>- After 7-8 iterations, I ran into a weird problem, where opensm was
>>>showing the HCA as UNKNOWN. The port
>>>never came up to ACTIVE state.  The unplugged and replugged into
>>>different slots, the port remained in INIT
>>>state.
>>
>>Mellanox    : SW : 12 : INI :      :     : 2048 : 1x  : 2.5 :
> 
> 0002c9010d26e780 : UNKNOWN
> 
>>OpenSM thinks that either there is no physical port on the other end
> 
> of
> 
>>the link or it is not "valid" (GUID non 0). Obviously it is there as
> 
> the
> 
>>port state is INIT so the physical link came up which requires the
>>remote end to be there.
> 
> 
>>From the log you sent, this is exactly what is happening.
> Sep 23 10:07:23 451191 [B7751BB0] -> osm_drop_mgr_process: Checking port
> 0x0002c9010d26e780.
> Sep 23 10:07:23 451209 [B7751BB0] -> osm_drop_mgr_process: Checking port
> 0x0002c90200400cfd.
> Sep 23 10:07:23 451226 [B7751BB0] -> osm_drop_mgr_process: ERR 0108:
> Unknown remote side for node 0x0002c9010d26e780 port 20. Adding to light
> sweep sampling list.
> Sep 23 10:07:23 451251 [B7751BB0] -> Directed Path Dump of 1 hop path:
>                                 Path = [0][1]
> Sep 23 10:07:23 451267 [B7751BB0] -> osm_drop_mgr_process: ]
> 
> So look in osm_drop_mgr.c line 707:
> Can you enhance the log display to see which is failing: 
> osm_physp_is_valid(p_physp) or osm_physp_get_remote(p_physp) ? 
> 
> Also, it appears to keep light sweeping this port but whichever switch
> port it is on, it does not respond. Not sure where the problem is. It
> could be on the outgoing side of the switch (we could run diags against
> the switch and various ports; I would be curious what they return when
> the subnet is in this broken state) or on the HCA. However, the fact
> that restarting opensm made it go away without touching anything else
> makes this appear otherwise.
> 
> 
>>One other note is that it appears to have come up as 1x. Is that what
>>should happen ? 
> 
> 
> -- Hal
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>