[ewg] OpenSM 1.5.4 Boot Problem

Hal Rosenstock hal at dev.mellanox.co.il
Wed Dec 14 06:03:40 PST 2011


Hi,

On 12/13/2011 2:35 PM, Hector Abrach wrote:
> Hello,
> 
> I have a boot problem with OpenSM

Are you saying the switch is booted rather than OpenSM ?

What is the OpenSM running on and in what environment ?

> the problem occurs seldomly and
> started to ocur when we started using a new Mellanox MT1118X03342 switch.
> The problem occurs during the discovery phase within state_mgr_sweep_hop_1.
> 
> However, I discovered that the actual location is because the
> qp0_mads_outsanding stalls at 1 occasionally.

Is it stuck or after timeout/retry does this get updated properly ?

> Within file osm_vl15intf.c in function vl15_poller it checks at the
> rfifo and if the qlist still has items it applies function vl15_send_mad
> which later on triggers the signal.
> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
> noticed that cl_qlist_end reaches zero before
> stats->qp0_mads_outstanding does. This causes a stall in
> cl_event_wait_on. The rfifo always reaches 0 when there are 4
> qp0_mads_outstanding however when it fails it always fails when there is
> 1 qp0_mad_outstanding.

Is some (request) SMP that OpenSM sent timing out (not being responded to) ?

> Have you seen this failure? By the way, I see this failure once every 15
> reboots approximately.
> 
> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
> problem.

What do you mean exactly by fixes the problem ? I'm not sure I
understand what the problem is yet.

-- Hal

> My guess is that there is a race condition when the switch sends 4 SMPs
> in parallel. Also, this failure only appears to occur at reboot. Another
> solution which is not acceptable is when I add a delay in the process
> the failure goes away. This as if the switch needed more time to do
> something.
> 
> I would really appreciate your help and insight.
> Thank you
> 
> Hector Abrach
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
> 
> 
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg




More information about the ewg mailing list