[ewg] OpenSM 1.5.4 Boot Problem
Hector Abrach
HAbrach at TMRIUSA.COM
Tue Dec 13 11:35:19 PST 2011
Hello,
I have a boot problem with OpenSM the problem occurs seldomly and started
to ocur when we started using a new Mellanox MT1118X03342 switch.
The problem occurs during the discovery phase within
state_mgr_sweep_hop_1.
However, I discovered that the actual location is because the
qp0_mads_outsanding stalls at 1 occasionally.
Within file osm_vl15intf.c in function vl15_poller it checks at the rfifo
and if the qlist still has items it applies function vl15_send_mad which
later on triggers the signal.
With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
noticed that cl_qlist_end reaches zero before stats->qp0_mads_outstanding
does. This causes a stall in cl_event_wait_on. The rfifo always reaches 0
when there are 4 qp0_mads_outstanding however when it fails it always
fails when there is 1 qp0_mad_outstanding.
Have you seen this failure? By the way, I see this failure once every 15
reboots approximately.
I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
problem.
My guess is that there is a race condition when the switch sends 4 SMPs in
parallel. Also, this failure only appears to occur at reboot. Another
solution which is not acceptable is when I add a delay in the process the
failure goes away. This as if the switch needed more time to do something.
I would really appreciate your help and insight.
Thank you
Hector Abrach
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20111213/5e4b243b/attachment.html>
More information about the ewg
mailing list