[ewg] OpenSM 1.5.4 Boot Problem

Wed Dec 14 18:23:09 PST 2011

Hector,

On 12/14/2011 1:41 PM, Hector Abrach wrote:
> Hal,
> 
> Sorry for the multiple emails, but I was thinking how it may be a
> "freeze /stall" rather than a time out.  One reason is that it doesn't
> send an error message, is as if the log completely dies.

So nothing interesting in the log...

> However, in
> file osm_vendor_ibumad.c under function umad_receiver there is an
> infinite loop "for(;;)" which seems to die when I get to that previously
> discussed vl15_poller. I checked to see if it breaks out of the loop but
> it doesn't seem to. 

It never breaks out of that loop except when OpenSM is shutting down.
That's the basic receive loop.

-- Hal

> I'm not sure if this may be an additional hint.
> Thank you
> 
> Hector Abrach
> 
> 
> From: 	Hector Abrach <HAbrach at TMRIUSA.COM>
> To: 	Hal Rosenstock <hal at dev.mellanox.co.il>
> Cc: 	ewg at lists.openfabrics.org
> Date: 	12/14/2011 11:15 AM
> Subject: 	Re: [ewg] OpenSM 1.5.4 Boot Problem
> Sent by: 	ewg-bounces at lists.openfabrics.org
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> Hal,
> 
> Thank you very much for the support, I am the same person from the gmail
> account so I will respond through here.
> 
> Attached is a picture of the switch serial number:
> 
> 
> 
> I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
> system which I reboot via a script over and over again. Technically
> speaking the switch is not being powered off or physically rebooted. My
> server system is what is being rebooted. I am running OpenSM on one of
> the 7 servers. This means I'm constantly shutting down and rebooting
> OpenSM. I am running OpenSM on QNX but we have not had this problem
> until we decided to upgrade to this switch.
> 
> The problem is that every 1 out of 15 of this remote reboots OpenSM
> stalls or times out because stats->qp0_mads_outstanding did not reach
> zero. Please excuse my ignorance as I'm relatively new at this but how
> do I verify if it is a timeout problem vs a stall?
> 
> You also mentioned that you'd like to see the Verbose output of openSM;
> however, when I run in Verbose mode I don't see the problem. It appears
> as if the verbose output stalls enough time to give the switch time to
> do what ever it needs to do and hence not have the problem occur. But
> this is the last I see when the problem occurs:
> 
> 
> 
> -------------------------------------------------
> OpenSM 3.3.12
> Command Line Arguments:
> Log file max size is 5 MBytes
> Log File: /tmp/opensm.log
> -------------------------------------------------
> OpenSM 3.3.12
> 
> Entering DISCOVERING state
> 
> Using default GUID 0x2c9020023277d
> 
> 
> 
> The problem occurs in function osm_vl15intf.c -> vl15_poller in the else
> statement.
> 
> if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
>        OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG,
>        "Servicing p_madw = %p\n", p_madw);
>        if (osm_log_is_active(p_vl->p_log, OSM_LOG_FRAMES))
>        osm_dump_dr_smp(p_vl->p_log,
>        osm_madw_get_smp_ptr(p_madw),
>        OSM_LOG_FRAMES);
> 
>        vl15_send_mad(p_vl, p_madw);
> } else
>        /*
>           The VL15 FIFO is empty, so we have nothing left to do.
>         */
>        status = cl_event_wait_on(&p_vl->signal,
>                  EVENT_NO_TIMEOUT, TRUE);
> 
> It won't move forward from the cl_event_wait_on in this line of code.
> However, there are other locations such as wait_for_pending_transactions
> in the do_sweep function that won't move forward from. But I believe
> this to be a side effect of the problem I'm mentioning.
> 
> When you mention what is my timeout, I'm guessing you refer to
> max_smps_timeout which is used in the second while loop within
> vl15_poller? For this setting I am using the default which is defined in
> osm_subnet.c as:
> 
> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
>    p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT;
>    p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout
> *p_opt->transaction_retries;
> 
> Would you explain to me what are the advantages or disadvantages of
> OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth
> performance at all?
> 
> I noticed that when using the default setting of 4 I get into the else
> of the above if statement when there are 4 qp0_mads_outstanding. I
> noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get
> the failure I'm mentioning at all. Partly (I think) because I don't
> enter the else in the if statement until there is 1 qp0_mads_outstanding.
> 
> I hope this explains the problem well enough and it may be a time out
> problem but I'd like to understand why the problem is occurring.
> Thank you very much,
> 
> Hector Abrach
> 
> From:	Hal Rosenstock <hal at dev.mellanox.co.il>
> To:	Hector Abrach <HAbrach at TMRIUSA.COM>
> Cc:	ewg at lists.openfabrics.org
> Date:	12/14/2011 08:03 AM
> Subject:	Re: [ewg] OpenSM 1.5.4 Boot Problem
> 
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> Hi,
> 
> On 12/13/2011 2:35 PM, Hector Abrach wrote:
>> Hello,
>>
>> I have a boot problem with OpenSM
> 
> Are you saying the switch is booted rather than OpenSM ?
> 
> What is the OpenSM running on and in what environment ?
> 
>> the problem occurs seldomly and
>> started to ocur when we started using a new Mellanox MT1118X03342 switch.
>> The problem occurs during the discovery phase within
> state_mgr_sweep_hop_1.
>>
>> However, I discovered that the actual location is because the
>> qp0_mads_outsanding stalls at 1 occasionally.
> 
> Is it stuck or after timeout/retry does this get updated properly ?
> 
>> Within file osm_vl15intf.c in function vl15_poller it checks at the
>> rfifo and if the qlist still has items it applies function vl15_send_mad
>> which later on triggers the signal.
>> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
>> noticed that cl_qlist_end reaches zero before
>> stats->qp0_mads_outstanding does. This causes a stall in
>> cl_event_wait_on. The rfifo always reaches 0 when there are 4
>> qp0_mads_outstanding however when it fails it always fails when there is
>> 1 qp0_mad_outstanding.
> 
> Is some (request) SMP that OpenSM sent timing out (not being responded to) ?
> 
>> Have you seen this failure? By the way, I see this failure once every 15
>> reboots approximately.
>>
>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
>> problem.
> 
> What do you mean exactly by fixes the problem ? I'm not sure I
> understand what the problem is yet.
> 
> -- Hal
> 
>> My guess is that there is a race condition when the switch sends 4 SMPs
>> in parallel. Also, this failure only appears to occur at reboot. Another
>> solution which is not acceptable is when I add a delay in the process
>> the failure goes away. This as if the switch needed more time to do
>> something.
>>
>> I would really appreciate your help and insight.
>> Thank you
>>
>> Hector Abrach
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit _http://www.symanteccloud.com_
> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
>>
>>
>> _______________________________________________
>> ewg mailing list
>> ewg at lists.openfabrics.org
>> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit _http://www.symanteccloud.com_
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> ______________________________________________________________________[attachment
> "2011-12-13_10-18-25_182.jpg" deleted by Hector Abrach/Software/TMRU]
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________