[ewg] OpenSM 1.5.4 Boot Problem

Hector Abrach HAbrach at TMRIUSA.COM
Tue Dec 13 11:35:19 PST 2011


Hello,

I have a boot problem with OpenSM the problem occurs seldomly and started 
to ocur when we started using a new Mellanox MT1118X03342 switch.
The problem occurs during the discovery phase within 
state_mgr_sweep_hop_1.

However, I discovered that the actual location is because the 
qp0_mads_outsanding stalls at 1 occasionally.

Within file osm_vl15intf.c in function vl15_poller it checks at the rfifo 
and if the qlist still has items it applies function vl15_send_mad which 
later on triggers the signal.
With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I 
noticed that cl_qlist_end reaches zero before stats->qp0_mads_outstanding 
does. This causes a stall in cl_event_wait_on. The rfifo always reaches 0 
when there are 4 qp0_mads_outstanding however when it fails it always 
fails when there is 1 qp0_mad_outstanding.

Have you seen this failure? By the way, I see this failure once every 15 
reboots approximately.

I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the 
problem.

My guess is that there is a race condition when the switch sends 4 SMPs in 
parallel. Also, this failure only appears to occur at reboot. Another 
solution which is not acceptable is when I add a delay in the process the 
failure goes away. This as if the switch needed more time to do something.

I would really appreciate your help and insight.
Thank you

Hector Abrach

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20111213/5e4b243b/attachment.html>


More information about the ewg mailing list