[ewg] OpenSM 1.5.4 Boot Problem
Hal Rosenstock
hal at dev.mellanox.co.il
Wed Dec 14 18:23:09 PST 2011
Hector,
On 12/14/2011 1:41 PM, Hector Abrach wrote:
> Hal,
>
> Sorry for the multiple emails, but I was thinking how it may be a
> "freeze /stall" rather than a time out. One reason is that it doesn't
> send an error message, is as if the log completely dies.
So nothing interesting in the log...
> However, in
> file osm_vendor_ibumad.c under function umad_receiver there is an
> infinite loop "for(;;)" which seems to die when I get to that previously
> discussed vl15_poller. I checked to see if it breaks out of the loop but
> it doesn't seem to.
It never breaks out of that loop except when OpenSM is shutting down.
That's the basic receive loop.
-- Hal
> I'm not sure if this may be an additional hint.
> Thank you
>
> Hector Abrach
>
>
> From: Hector Abrach <HAbrach at TMRIUSA.COM>
> To: Hal Rosenstock <hal at dev.mellanox.co.il>
> Cc: ewg at lists.openfabrics.org
> Date: 12/14/2011 11:15 AM
> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem
> Sent by: ewg-bounces at lists.openfabrics.org
>
>
> ------------------------------------------------------------------------
>
>
>
> Hal,
>
> Thank you very much for the support, I am the same person from the gmail
> account so I will respond through here.
>
> Attached is a picture of the switch serial number:
>
>
>
> I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
> system which I reboot via a script over and over again. Technically
> speaking the switch is not being powered off or physically rebooted. My
> server system is what is being rebooted. I am running OpenSM on one of
> the 7 servers. This means I'm constantly shutting down and rebooting
> OpenSM. I am running OpenSM on QNX but we have not had this problem
> until we decided to upgrade to this switch.
>
> The problem is that every 1 out of 15 of this remote reboots OpenSM
> stalls or times out because stats->qp0_mads_outstanding did not reach
> zero. Please excuse my ignorance as I'm relatively new at this but how
> do I verify if it is a timeout problem vs a stall?
>
> You also mentioned that you'd like to see the Verbose output of openSM;
> however, when I run in Verbose mode I don't see the problem. It appears
> as if the verbose output stalls enough time to give the switch time to
> do what ever it needs to do and hence not have the problem occur. But
> this is the last I see when the problem occurs:
>
>
>
> -------------------------------------------------
> OpenSM 3.3.12
> Command Line Arguments:
> Log file max size is 5 MBytes
> Log File: /tmp/opensm.log
> -------------------------------------------------
> OpenSM 3.3.12
>
> Entering DISCOVERING state
>
> Using default GUID 0x2c9020023277d
>
>
>
> The problem occurs in function osm_vl15intf.c -> vl15_poller in the else
> statement.
>
> if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
> OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG,
> "Servicing p_madw = %p\n", p_madw);
> if (osm_log_is_active(p_vl->p_log, OSM_LOG_FRAMES))
> osm_dump_dr_smp(p_vl->p_log,
> osm_madw_get_smp_ptr(p_madw),
> OSM_LOG_FRAMES);
>
> vl15_send_mad(p_vl, p_madw);
> } else
> /*
> The VL15 FIFO is empty, so we have nothing left to do.
> */
> status = cl_event_wait_on(&p_vl->signal,
> EVENT_NO_TIMEOUT, TRUE);
>
> It won't move forward from the cl_event_wait_on in this line of code.
> However, there are other locations such as wait_for_pending_transactions
> in the do_sweep function that won't move forward from. But I believe
> this to be a side effect of the problem I'm mentioning.
>
> When you mention what is my timeout, I'm guessing you refer to
> max_smps_timeout which is used in the second while loop within
> vl15_poller? For this setting I am using the default which is defined in
> osm_subnet.c as:
>
> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
> p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT;
> p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout
> *p_opt->transaction_retries;
>
> Would you explain to me what are the advantages or disadvantages of
> OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth
> performance at all?
>
> I noticed that when using the default setting of 4 I get into the else
> of the above if statement when there are 4 qp0_mads_outstanding. I
> noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get
> the failure I'm mentioning at all. Partly (I think) because I don't
> enter the else in the if statement until there is 1 qp0_mads_outstanding.
>
> I hope this explains the problem well enough and it may be a time out
> problem but I'd like to understand why the problem is occurring.
> Thank you very much,
>
> Hector Abrach
>
> From: Hal Rosenstock <hal at dev.mellanox.co.il>
> To: Hector Abrach <HAbrach at TMRIUSA.COM>
> Cc: ewg at lists.openfabrics.org
> Date: 12/14/2011 08:03 AM
> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem
>
>
>
> ------------------------------------------------------------------------
>
>
>
> Hi,
>
> On 12/13/2011 2:35 PM, Hector Abrach wrote:
>> Hello,
>>
>> I have a boot problem with OpenSM
>
> Are you saying the switch is booted rather than OpenSM ?
>
> What is the OpenSM running on and in what environment ?
>
>> the problem occurs seldomly and
>> started to ocur when we started using a new Mellanox MT1118X03342 switch.
>> The problem occurs during the discovery phase within
> state_mgr_sweep_hop_1.
>>
>> However, I discovered that the actual location is because the
>> qp0_mads_outsanding stalls at 1 occasionally.
>
> Is it stuck or after timeout/retry does this get updated properly ?
>
>> Within file osm_vl15intf.c in function vl15_poller it checks at the
>> rfifo and if the qlist still has items it applies function vl15_send_mad
>> which later on triggers the signal.
>> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
>> noticed that cl_qlist_end reaches zero before
>> stats->qp0_mads_outstanding does. This causes a stall in
>> cl_event_wait_on. The rfifo always reaches 0 when there are 4
>> qp0_mads_outstanding however when it fails it always fails when there is
>> 1 qp0_mad_outstanding.
>
> Is some (request) SMP that OpenSM sent timing out (not being responded to) ?
>
>> Have you seen this failure? By the way, I see this failure once every 15
>> reboots approximately.
>>
>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
>> problem.
>
> What do you mean exactly by fixes the problem ? I'm not sure I
> understand what the problem is yet.
>
> -- Hal
>
>> My guess is that there is a race condition when the switch sends 4 SMPs
>> in parallel. Also, this failure only appears to occur at reboot. Another
>> solution which is not acceptable is when I add a delay in the process
>> the failure goes away. This as if the switch needed more time to do
>> something.
>>
>> I would really appreciate your help and insight.
>> Thank you
>>
>> Hector Abrach
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit _http://www.symanteccloud.com_
> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
>>
>>
>> _______________________________________________
>> ewg mailing list
>> ewg at lists.openfabrics.org
>> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit _http://www.symanteccloud.com_
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> ______________________________________________________________________[attachment
> "2011-12-13_10-18-25_182.jpg" deleted by Hector Abrach/Software/TMRU]
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
More information about the ewg
mailing list