[openib-general] Re: Another opensm problem ?

Viswanath Krishnamurthy viswa.krish at gmail.com
Mon Sep 26 10:00:45 PDT 2005


Hi Eitan,

I see that message in the log.

-Viswa


On 9/24/05, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
> Hi Viswa and Hal,
>
> I have read through the thread and have few comments.
>
> But first let me see if I understand the test run correctly. The test is
> as follows:
> 1. OpenSM starts up configuring the subnet.
> 2. Then the user ears up a cable and connects it to the other side port of
> a switch
> 3. The SM is supposed to bring up the new connection
> 4. Step 2 is repeated until the SM stops responding.
>
> Well, if this is the case then OpenSM is might stop responding due to the
> following features:
> 1. We had in the past cases where bad hardware continuously flooded the SM
> with Traps.
> To protect against this kind of DOS attack we have implemented an adaptive
> filter in
> the SM trap receiver:
> If the exact same trap is received continuously from same source more then
> 10 times
> (with no more then of 5sec between the traps) they are considered DOS and
> are ignored.
> Please see osm_trap_rcv.c for details.
> 2. The way IB switches work is that each time a port of their changes
> state they:
> a. Set the "change bit" in the SwitchInfo
> b. Send a trap 128 to the SM. But Trap 128 does not carry the changed port
> number.
>
> So under a test case like you describe what can happen:
> 1. The SM decides to ignore trap 128 from the switch as more then 5
> connect/reconnect sequences
> happen with not enough "quite" time to recover.
> 2. The SwitchInfo ChangeBit is sampled during the OSM light sweep. There
> is a race between the
> reading of the change bit and the clearing of it. If the connect
> disconnect happen very fast
> the change bit set by the re-connect can be cleaned by the clear starting
> by the disconnect.
>
> It is easy to see in the log file if the SM did ignore traps. Run with -V
> and look for:
> grep "Continuously received this trap" /var/log/osm.log
>
> (for some reason I did not get any log attachments with this thread -
> otherwise I would
> do some analysis on it too).
>
> Anyway, if the SM does not heavy sweep (due to the above) it is very
> likely it will continue to
> poll the non existing node that was previously attached to a switch port
> with no success.
>
> So testing of cable tear off and reconnect should be done with at least 10
> seconds recovery time.
> Also you could try sending kill -HUP to the OpenSM process and see if the
> full sweep you start
> is able to bring all ports up.
>
> Viswa, with all that said, it is very possible you are experiencing a bug
> in OpenSM and we
> want to encourage your effort finding those. With your, and others, help
> we will be able to
> flush them out.
>
> Thanks
>
> Eitan
>
> Hal Rosenstock wrote:
> > On Fri, 2005-09-23 at 14:57, Hal Rosenstock wrote:
> >
> >>On Fri, 2005-09-23 at 13:50, Viswanath Krishnamurthy wrote:
> >>
> >>>- After 7-8 iterations, I ran into a weird problem, where opensm was
> >>>showing the HCA as UNKNOWN. The port
> >>>never came up to ACTIVE state. The unplugged and replugged into
> >>>different slots, the port remained in INIT
> >>>state.
> >>
> >>Mellanox : SW : 12 : INI : : : 2048 : 1x : 2.5 :
> >
> > 0002c9010d26e780 : UNKNOWN
> >
> >>OpenSM thinks that either there is no physical port on the other end
> >
> > of
> >
> >>the link or it is not "valid" (GUID non 0). Obviously it is there as
> >
> > the
> >
> >>port state is INIT so the physical link came up which requires the
> >>remote end to be there.
> >
> >
> >>From the log you sent, this is exactly what is happening.
> > Sep 23 10:07:23 451191 [B7751BB0] -> osm_drop_mgr_process: Checking port
> > 0x0002c9010d26e780.
> > Sep 23 10:07:23 451209 [B7751BB0] -> osm_drop_mgr_process: Checking port
> > 0x0002c90200400cfd.
> > Sep 23 10:07:23 451226 [B7751BB0] -> osm_drop_mgr_process: ERR 0108:
> > Unknown remote side for node 0x0002c9010d26e780 port 20. Adding to light
> > sweep sampling list.
> > Sep 23 10:07:23 451251 [B7751BB0] -> Directed Path Dump of 1 hop path:
> > Path = [0][1]
> > Sep 23 10:07:23 451267 [B7751BB0] -> osm_drop_mgr_process: ]
> >
> > So look in osm_drop_mgr.c line 707:
> > Can you enhance the log display to see which is failing:
> > osm_physp_is_valid(p_physp) or osm_physp_get_remote(p_physp) ?
> >
> > Also, it appears to keep light sweeping this port but whichever switch
> > port it is on, it does not respond. Not sure where the problem is. It
> > could be on the outgoing side of the switch (we could run diags against
> > the switch and various ports; I would be curious what they return when
> > the subnet is in this broken state) or on the HCA. However, the fact
> > that restarting opensm made it go away without touching anything else
> > makes this appear otherwise.
> >
> >
> >>One other note is that it appears to have come up as 1x. Is that what
> >>should happen ?
> >
> >
> > -- Hal
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050926/387793f8/attachment.html>


More information about the general mailing list