<font size=2 face="sans-serif">Alex,</font>
<br>
<br><tt><font size=2>> Few more questions.<br>
> Does this happen to you only when you try to shut down the OpenSM
on reboot?</font></tt>
<br>
<br><font size=2 face="sans-serif">Our system servers don't have an actual
hard drive which means we boot remotely. So, when I run the re-boot script
OpenSM doesn't shutdown properly (may this affect the switch?). However,
it always boots the same way. The problem occurs when the system is in
the bring up process. Specifically for OpenSM it occurs in the Discovering
state.</font>
<br><tt><font size=2><br>
> What is the host cpu architecture? x86/x86_64/ppc?</font></tt>
<br>
<br><tt><font size=2>We use x86_64 but QNX is only a 32-bit OS which means
we are technically running as 32-bit.</font></tt>
<br><tt><font size=2>Thanks,</font></tt>
<br>
<br><tt><font size=2>Hector Abrach<br>
</font></tt>
<br>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">From:</font>
<td><font size=1 face="sans-serif">Alex Netes <alexne@mellanox.com></font>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">To:</font>
<td><font size=1 face="sans-serif">Hal Rosenstock <hal@dev.mellanox.co.il>,
Hector Abrach <HAbrach@TMRIUSA.COM></font>
<tr>
<td valign=top><font size=1 color=#5f5f5f face="sans-serif">Cc:</font>
<td><font size=1 face="sans-serif">"ewg@lists.openfabrics.org"
<ewg@lists.openfabrics.org></font>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">Date:</font>
<td><font size=1 face="sans-serif">12/16/2011 03:15 AM</font>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">Subject:</font>
<td><font size=1 face="sans-serif">RE: [ewg] OpenSM 1.5.4 Boot Problem</font></table>
<br>
<hr noshade>
<br>
<br>
<br><tt><font size=2>Hi Hector,<br>
<br>
Few more questions.<br>
Does this happen to you only when you try to shut down the OpenSM on reboot?<br>
What is the host cpu architecture? x86/x86_64/ppc?<br>
<br>
<br>
> -----Original Message-----<br>
> From: ewg-bounces@lists.openfabrics.org [</font></tt><a href="mailto:ewg-"><tt><font size=2>mailto:ewg-</font></tt></a><tt><font size=2><br>
> bounces@lists.openfabrics.org] On Behalf Of Hal Rosenstock<br>
> Sent: Thursday, December 15, 2011 9:06 PM<br>
> To: Hector Abrach<br>
> Cc: ewg@lists.openfabrics.org<br>
> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem<br>
> <br>
> Hector,<br>
> <br>
> On 12/15/2011 12:49 PM, Hector Abrach wrote:<br>
> > Hal,<br>
> ><br>
> > Thank you for the response. To address your questions:<br>
> ><br>
> >> So the switch stays up and the servers (including the one
OpenSM is<br>
> >> on) is rebooted, right ?<br>
> ><br>
> > Right.<br>
> ><br>
> >> Do the servers run QNX rather than Linux ? Are you saying
all OpenSM<br>
> >> code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3)
?<br>
> ><br>
> > Yes, all 7 servers run QNX. The OpenSM code is 99% the same,
the only<br>
> > changes I had to make were made to some #define libraries.<br>
> > The big changes were made for the driver, not so much OpenSM.<br>
> <br>
> I would think there are also changes for porting of complib to QNX.
Do you<br>
> use osm_vendor_ibumad.c as the OpenSM vendor layer ?<br>
> <br>
> > I'm using IBNet 1.3.<br>
> <br>
> What's IBNet 1.3 ? I'm not familiar with that.<br>
> <br>
> > OpenSM always runs on the same one server, the others don't run
it.<br>
> <br>
> Understood.<br>
> <br>
> >> Is the topology the 7 servers and the 1 switch and if you
use other<br>
> >> switches you don't see this issue ?<br>
> ><br>
> > That's correct, the topology is 7 servers and 1 switch. We typically<br>
> > use less servers (4) for our application but the problem is more<br>
> > easily reproducible with more servers so we have a 7 server setup
with<br>
> > 1 switch. We don't have a great selection of switches but I know
our<br>
> > previous switch did not cause this problem. Our intention is
to go to<br>
> > production with this new switch but we can't release until we
find an<br>
> > acceptable solution.<br>
> ><br>
> >>Ican see the responses but not the requests. What verbosity
level did<br>
> >>you use ?<br>
> ><br>
> > I ran OpenSM with level -D 0x06 (error, info, verbose). I don't
want<br>
> > to do -D 0xFF because I know this fixes the problem for sure.<br>
> <br>
> I think -D 0x23 (error, info, frames) would do the trick...<br>
> <br>
> > -------------------------<br>
> ><br>
> > In summary:<br>
> > 1. knowing that the system gets stuck
for sm_vendor_ibumad.c -><br>
> > umad_receiver() -> "for(;;)" but keeps running properly
for function<br>
> > main.c -> osm_manager_loop().<br>
> > 2. If I use -D 0xFF the problem is
completely fixed<br>
> > 3. if I use OSM_DEFAULT_SMP_MAX_ON_WIRE
of 1 instead of any other<br>
> > value the problem is completely fixed<br>
> > 4. The failure always occurs with
qp0_mads_outstanding of 1<br>
> > remaining<br>
> > what do you think could be wrong?<br>
> > Do you think the driver could be the problem?<br>
> <br>
> Yes; The thing that I think is a likely suspect and may be missing
and causing<br>
> this issue is the (built in to kernel MAD in Linux) timeout retry
code for MAD<br>
> transactions which if the timeout/retries are exhaused triggers a
send error<br>
> (callback). Is that implemented ?<br>
> <br>
> However, I don't have a good explanation for why you see this now
and not<br>
> before with your other switches but maybe that's not important.<br>
> <br>
> > What debug command should I use to see the sent requests?<br>
> <br>
> See above.<br>
> <br>
> -- Hal<br>
> <br>
> > Thank you<br>
> ><br>
> > Hector Abrach<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > From:
Hal Rosenstock <hal@dev.mellanox.co.il><br>
> > To:
Hector Abrach <HAbrach@TMRIUSA.COM><br>
> > Cc:
ewg@lists.openfabrics.org<br>
> > Date:
12/14/2011 08:23 PM<br>
> > Subject:
Re: [ewg] OpenSM 1.5.4 Boot Problem<br>
> ><br>
> ><br>
> > ----------------------------------------------------------------------<br>
> > --<br>
> ><br>
> ><br>
> ><br>
> > Hector,<br>
> ><br>
> > On 12/14/2011 1:41 PM, Hector Abrach wrote:<br>
> >> Hal,<br>
> >><br>
> >> Sorry for the multiple emails, but I was thinking how it
may be a<br>
> >> "freeze /stall" rather than a time out. One
reason is that it<br>
> >> doesn't send an error message, is as if the log completely
dies.<br>
> ><br>
> > So nothing interesting in the log...<br>
> ><br>
> >> However, in<br>
> >> file osm_vendor_ibumad.c under function umad_receiver there
is an<br>
> >> infinite loop "for(;;)" which seems to die when
I get to that<br>
> >> previously discussed vl15_poller. I checked to see if it
breaks out<br>
> >> of the loop but it doesn't seem to.<br>
> ><br>
> > It never breaks out of that loop except when OpenSM is shutting
down.<br>
> > That's the basic receive loop.<br>
> ><br>
> > -- Hal<br>
> ><br>
> >> I'm not sure if this may be an additional hint.<br>
> >> Thank you<br>
> >><br>
> >> Hector Abrach<br>
> >><br>
> >><br>
> >> From:
Hector Abrach <HAbrach@TMRIUSA.COM><br>
> >> To:
Hal Rosenstock <hal@dev.mellanox.co.il><br>
> >> Cc:
ewg@lists.openfabrics.org<br>
> >> Date:
12/14/2011 11:15 AM<br>
> >> Subject:
Re: [ewg] OpenSM 1.5.4 Boot Problem<br>
> >> Sent by:
ewg-bounces@lists.openfabrics.org<br>
> >><br>
> >><br>
> >> ---------------------------------------------------------------------<br>
> >> ---<br>
> >><br>
> >><br>
> >><br>
> >> Hal,<br>
> >><br>
> >> Thank you very much for the support, I am the same person
from the<br>
> >> gmail account so I will respond through here.<br>
> >><br>
> >> Attached is a picture of the switch serial number:<br>
> >><br>
> >><br>
> >><br>
> >> I am indeed using OFED 1.5.4-rc3. My experiment consists
of a 7<br>
> >> server system which I reboot via a script over and over again.<br>
> >> Technically speaking the switch is not being powered off
or<br>
> >> physically rebooted. My server system is what is being rebooted.
I am<br>
> >> running OpenSM on one of the 7 servers. This means I'm constantly<br>
> >> shutting down and rebooting OpenSM. I am running OpenSM on
QNX but<br>
> we<br>
> >> have not had this problem until we decided to upgrade to
this switch.<br>
> >><br>
> >> The problem is that every 1 out of 15 of this remote reboots
OpenSM<br>
> >> stalls or times out because stats->qp0_mads_outstanding
did not reach<br>
> >> zero. Please excuse my ignorance as I'm relatively new at
this but<br>
> >> how do I verify if it is a timeout problem vs a stall?<br>
> >><br>
> >> You also mentioned that you'd like to see the Verbose output
of<br>
> >> openSM; however, when I run in Verbose mode I don't see the
problem.<br>
> >> It appears as if the verbose output stalls enough time to
give the<br>
> >> switch time to do what ever it needs to do and hence not
have the<br>
> >> problem occur. But this is the last I see when the problem
occurs:<br>
> >><br>
> >><br>
> >><br>
> >> -------------------------------------------------<br>
> >> OpenSM 3.3.12<br>
> >> Command Line Arguments:<br>
> >> Log file max size is 5 MBytes<br>
> >> Log File: /tmp/opensm.log<br>
> >> -------------------------------------------------<br>
> >> OpenSM 3.3.12<br>
> >><br>
> >> Entering DISCOVERING state<br>
> >><br>
> >> Using default GUID 0x2c9020023277d<br>
> >><br>
> >><br>
> >><br>
> >> The problem occurs in function osm_vl15intf.c -> vl15_poller
in the<br>
> >> else statement.<br>
> >><br>
> >> if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {<br>
> >> OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG,<br>
> >> "Servicing p_madw = %p\n",
p_madw);<br>
> >> if (osm_log_is_active(p_vl->p_log,
OSM_LOG_FRAMES))<br>
> >> osm_dump_dr_smp(p_vl->p_log,<br>
> >> osm_madw_get_smp_ptr(p_madw),<br>
> >> OSM_LOG_FRAMES);<br>
> >><br>
> >> vl15_send_mad(p_vl, p_madw);<br>
> >> } else<br>
> >> /*<br>
> >> The VL15 FIFO is empty,
so we have nothing left to do.<br>
> >> */<br>
> >> status = cl_event_wait_on(&p_vl->signal,<br>
> >> EVENT_NO_TIMEOUT,
TRUE);<br>
> >><br>
> >> It won't move forward from the cl_event_wait_on in this line
of code.<br>
> >> However, there are other locations such as<br>
> >> wait_for_pending_transactions in the do_sweep function that
won't<br>
> >> move forward from. But I believe this to be a side effect
of the problem<br>
> I'm mentioning.<br>
> >><br>
> >> When you mention what is my timeout, I'm guessing you refer
to<br>
> >> max_smps_timeout which is used in the second while loop within<br>
> >> vl15_poller? For this setting I am using the default which
is defined<br>
> >> in osm_subnet.c as:<br>
> >><br>
> >> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;<br>
> >> p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT;<br>
> >> p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout<br>
> >> *p_opt->transaction_retries;<br>
> >><br>
> >> Would you explain to me what are the advantages or disadvantages
of<br>
> >> OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my<br>
> bandwidth<br>
> >> performance at all?<br>
> >><br>
> >> I noticed that when using the default setting of 4 I get
into the<br>
> >> else of the above if statement when there are 4 qp0_mads_outstanding.<br>
> >> I noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to
1 I don't<br>
> >> get the failure I'm mentioning at all. Partly (I think) because
I<br>
> >> don't enter the else in the if statement until there is 1<br>
> qp0_mads_outstanding.<br>
> >><br>
> >> I hope this explains the problem well enough and it may be
a time out<br>
> >> problem but I'd like to understand why the problem is occurring.<br>
> >> Thank you very much,<br>
> >><br>
> >> Hector Abrach<br>
> >><br>
> >> From:
Hal Rosenstock <hal@dev.mellanox.co.il><br>
> >> To:
Hector Abrach <HAbrach@TMRIUSA.COM><br>
> >> Cc:
ewg@lists.openfabrics.org<br>
> >> Date:
12/14/2011 08:03 AM<br>
> >> Subject:
Re: [ewg] OpenSM 1.5.4 Boot Problem<br>
> >><br>
> >><br>
> >><br>
> >> ---------------------------------------------------------------------<br>
> >> ---<br>
> >><br>
> >><br>
> >><br>
> >> Hi,<br>
> >><br>
> >> On 12/13/2011 2:35 PM, Hector Abrach wrote:<br>
> >>> Hello,<br>
> >>><br>
> >>> I have a boot problem with OpenSM<br>
> >><br>
> >> Are you saying the switch is booted rather than OpenSM ?<br>
> >><br>
> >> What is the OpenSM running on and in what environment ?<br>
> >><br>
> >>> the problem occurs seldomly and<br>
> >>> started to ocur when we started using a new Mellanox
MT1118X03342<br>
> switch.<br>
> >>> The problem occurs during the discovery phase within<br>
> >> state_mgr_sweep_hop_1.<br>
> >>><br>
> >>> However, I discovered that the actual location is because
the<br>
> >>> qp0_mads_outsanding stalls at 1 occasionally.<br>
> >><br>
> >> Is it stuck or after timeout/retry does this get updated
properly ?<br>
> >><br>
> >>> Within file osm_vl15intf.c in function vl15_poller it
checks at the<br>
> >>> rfifo and if the qlist still has items it applies function<br>
> >>> vl15_send_mad which later on triggers the signal.<br>
> >>> With the current default setting of 4 for<br>
> >>> OSM_DEFAULT_SMP_MAX_ON_WIRE I noticed that cl_qlist_end
reaches<br>
> zero<br>
> >>> before<br>
> >>> stats->qp0_mads_outstanding does. This causes a stall
in<br>
> >>> cl_event_wait_on. The rfifo always reaches 0 when there
are 4<br>
> >>> qp0_mads_outstanding however when it fails it always
fails when<br>
> >>> there is<br>
> >>> 1 qp0_mad_outstanding.<br>
> >><br>
> >> Is some (request) SMP that OpenSM sent timing out (not being<br>
> >> responded<br>
> > to) ?<br>
> >><br>
> >>> Have you seen this failure? By the way, I see this failure
once<br>
> >>> every 15 reboots approximately.<br>
> >>><br>
> >>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE
to 1<br>
> fixes<br>
> >>> the problem.<br>
> >><br>
> >> What do you mean exactly by fixes the problem ? I'm not sure
I<br>
> >> understand what the problem is yet.<br>
> >><br>
> >> -- Hal<br>
> >><br>
> >>> My guess is that there is a race condition when the switch
sends 4<br>
> >>> SMPs in parallel. Also, this failure only appears to
occur at<br>
> >>> reboot. Another solution which is not acceptable is when
I add a<br>
> >>> delay in the process the failure goes away. This as if
the switch<br>
> >>> needed more time to do something.<br>
> >>><br>
> >>> I would really appreciate your help and insight.<br>
> >>> Thank you<br>
> >>><br>
> >>> Hector Abrach<br>
> >>><br>
> ___________________________________________________________________<br>
> _<br>
> >>> __ This email has been scanned by the Symantec Email
Security.cloud<br>
> >>> service.<br>
> >>> For more information please visit _http://www.symanteccloud.com_<br>
> >> <</font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com/</font></tt></a><tt><font size=2>><br>
> >>><br>
> ___________________________________________________________________<br>
> _<br>
> >>> __<br>
> >>><br>
> >>><br>
> >>> _______________________________________________<br>
> >>> ewg mailing list<br>
> >>> ewg@lists.openfabrics.org<br>
> >>> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_<br>
> >><br>
> >><br>
> >><br>
> ___________________________________________________________________<br>
> __<br>
> >> _ This email has been scanned by the Symantec Email Security.cloud<br>
> >> service.<br>
> >> For more information please visit _http://www.symanteccloud.com_<br>
> >> <</font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com/</font></tt></a><tt><font size=2>><br>
> >><br>
> ___________________________________________________________________<br>
> __<br>
> >> _<br>
> >><br>
> >><br>
> >><br>
> ___________________________________________________________________<br>
> __<br>
> >> _ This email has been scanned by the Symantec Email Security.cloud<br>
> >> service.<br>
> >> For more information please visit </font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com</font></tt></a><tt><font size=2><br>
> > <</font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com/</font></tt></a><tt><font size=2>><br>
> >> <</font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com/</font></tt></a><tt><font size=2>><br>
> >><br>
> ___________________________________________________________________<br>
> __<br>
> >> _<br>
> >><br>
> >><br>
> ___________________________________________________________________<br>
> __<br>
> >> _ This email has been scanned by the Symantec Email Security.cloud<br>
> >> service.<br>
> >> For more information please visit </font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com</font></tt></a><tt><font size=2><br>
> > <</font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com/</font></tt></a><tt><font size=2>><br>
> >> <</font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com/</font></tt></a><tt><font size=2>><br>
> >><br>
> ><br>
> ___________________________________________________________________<br>
> ___<br>
> > [attachment<br>
> >> "2011-12-13_10-18-25_182.jpg" deleted by Hector<br>
> Abrach/Software/TMRU]<br>
> >> _______________________________________________<br>
> >> ewg mailing list<br>
> >> ewg@lists.openfabrics.org<br>
> >> </font></tt><a href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg"><tt><font size=2>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg</font></tt></a><tt><font size=2><br>
> >><br>
> >><br>
> ___________________________________________________________________<br>
> __<br>
> >> _ This email has been scanned by the Symantec Email Security.cloud<br>
> >> service.<br>
> >> For more information please visit </font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com</font></tt></a><tt><font size=2><br>
> > <</font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com/</font></tt></a><tt><font size=2>><br>
> >><br>
> ___________________________________________________________________<br>
> __<br>
> >> _<br>
> ><br>
> ><br>
> ><br>
> ___________________________________________________________________<br>
> ___<br>
> > This email has been scanned by the Symantec Email Security.cloud
service.<br>
> > For more information please visit </font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com</font></tt></a><tt><font size=2><br>
> > <</font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com/</font></tt></a><tt><font size=2>><br>
> ><br>
> ___________________________________________________________________<br>
> ___<br>
> ><br>
> ><br>
> ><br>
> ___________________________________________________________________<br>
> ___<br>
> > This email has been scanned by the Symantec Email Security.cloud
service.<br>
> > For more information please visit </font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com</font></tt></a><tt><font size=2><br>
> ><br>
> ___________________________________________________________________<br>
> ___<br>
> <br>
> _______________________________________________<br>
> ewg mailing list<br>
> ewg@lists.openfabrics.org<br>
> </font></tt><a href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg"><tt><font size=2>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg</font></tt></a><tt><font size=2><br>
<br>
______________________________________________________________________<br>
This email has been scanned by the Symantec Email Security.cloud service.<br>
For more information please visit </font></tt><a href=http://www.symanteccloud.com/><tt><font size=2>http://www.symanteccloud.com</font></tt></a><tt><font size=2><br>
______________________________________________________________________<br>
</font></tt>
<br>
<br clear="both">
______________________________________________________________________<BR>
This email has been scanned by the Symantec Email Security.cloud service.<BR>
For more information please visit http://www.symanteccloud.com<BR>
______________________________________________________________________<BR>