[ewg] OpenSM 1.5.4 Boot Problem

Thu Dec 15 09:49:17 PST 2011

Hal,

Thank you for the response. To address your questions:

> So the switch stays up and the servers (including the one OpenSM is on)
> is rebooted, right ?

Right.

> Do the servers run QNX rather than Linux ? Are you saying all OpenSM
> code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?

Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only 
changes I had to make were made to some #define libraries.
The big changes were made for the driver, not so much OpenSM. I'm using 
IBNet 1.3. OpenSM always runs on the same one server, the others don't run 
it.

> Is the topology the 7 servers and the 1 switch and if you use other
> switches you don't see this issue ?

That's correct, the topology is 7 servers and 1 switch. We typically use 
less servers (4) for our application but the problem is more easily 
reproducible with more servers so we have a 7 server setup with 1 switch. 
We don't have a great selection of switches but I know our previous switch 
did not cause this problem. Our intention is to go to production with this 
new switch but we can't release until we find an acceptable solution.

>Ican see the responses but not the requests. What verbosity level did
> you use ?

I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want to do 
-D 0xFF because I know this fixes the problem for sure.

-------------------------

In summary:
1.      knowing that the system gets stuck for sm_vendor_ibumad.c -> 
umad_receiver() -> "for(;;)" but keeps running properly for function 
main.c -> osm_manager_loop().
2.      If I use -D 0xFF the problem is completely fixed
3.      if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other 
value the problem is completely fixed
4.      The failure always occurs with qp0_mads_outstanding of 1 remaining
what do you think could be wrong?
Do you think the driver could be the problem?
What debug command should I use to see the sent requests?

Thank you

Hector Abrach

From:
Hal Rosenstock <hal at dev.mellanox.co.il>
To:
Hector Abrach <HAbrach at TMRIUSA.COM>
Cc:
ewg at lists.openfabrics.org
Date:
12/14/2011 08:23 PM
Subject:
Re: [ewg] OpenSM 1.5.4 Boot Problem

Hector,

On 12/14/2011 1:41 PM, Hector Abrach wrote:
> Hal,
> 
> Sorry for the multiple emails, but I was thinking how it may be a
> "freeze /stall" rather than a time out.  One reason is that it doesn't
> send an error message, is as if the log completely dies.

So nothing interesting in the log...

> However, in
> file osm_vendor_ibumad.c under function umad_receiver there is an
> infinite loop "for(;;)" which seems to die when I get to that previously
> discussed vl15_poller. I checked to see if it breaks out of the loop but
> it doesn't seem to. 

It never breaks out of that loop except when OpenSM is shutting down.
That's the basic receive loop.

-- Hal

> I'm not sure if this may be an additional hint.
> Thank you
> 
> Hector Abrach
> 
> 
> From:                  Hector Abrach <HAbrach at TMRIUSA.COM>
> To:            Hal Rosenstock <hal at dev.mellanox.co.il>
> Cc:            ewg at lists.openfabrics.org
> Date:                  12/14/2011 11:15 AM
> Subject:               Re: [ewg] OpenSM 1.5.4 Boot Problem
> Sent by:               ewg-bounces at lists.openfabrics.org
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> Hal,
> 
> Thank you very much for the support, I am the same person from the gmail
> account so I will respond through here.
> 
> Attached is a picture of the switch serial number:
> 
> 
> 
> I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
> system which I reboot via a script over and over again. Technically
> speaking the switch is not being powered off or physically rebooted. My
> server system is what is being rebooted. I am running OpenSM on one of
> the 7 servers. This means I'm constantly shutting down and rebooting
> OpenSM. I am running OpenSM on QNX but we have not had this problem
> until we decided to upgrade to this switch.
> 
> The problem is that every 1 out of 15 of this remote reboots OpenSM
> stalls or times out because stats->qp0_mads_outstanding did not reach
> zero. Please excuse my ignorance as I'm relatively new at this but how
> do I verify if it is a timeout problem vs a stall?
> 
> You also mentioned that you'd like to see the Verbose output of openSM;
> however, when I run in Verbose mode I don't see the problem. It appears
> as if the verbose output stalls enough time to give the switch time to
> do what ever it needs to do and hence not have the problem occur. But
> this is the last I see when the problem occurs:
> 
> 
> 
> -------------------------------------------------
> OpenSM 3.3.12
> Command Line Arguments:
> Log file max size is 5 MBytes
> Log File: /tmp/opensm.log
> -------------------------------------------------
> OpenSM 3.3.12
> 
> Entering DISCOVERING state
> 
> Using default GUID 0x2c9020023277d
> 
> 
> 
> The problem occurs in function osm_vl15intf.c -> vl15_poller in the else
> statement.
> 
> if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
>        OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG,
>        "Servicing p_madw = %p\n", p_madw);
>        if (osm_log_is_active(p_vl->p_log, OSM_LOG_FRAMES))
>        osm_dump_dr_smp(p_vl->p_log,
>        osm_madw_get_smp_ptr(p_madw),
>        OSM_LOG_FRAMES);
> 
>        vl15_send_mad(p_vl, p_madw);
> } else
>        /*
>           The VL15 FIFO is empty, so we have nothing left to do.
>         */
>        status = cl_event_wait_on(&p_vl->signal,
>                  EVENT_NO_TIMEOUT, TRUE);
> 
> It won't move forward from the cl_event_wait_on in this line of code.
> However, there are other locations such as wait_for_pending_transactions
> in the do_sweep function that won't move forward from. But I believe
> this to be a side effect of the problem I'm mentioning.
> 
> When you mention what is my timeout, I'm guessing you refer to
> max_smps_timeout which is used in the second while loop within
> vl15_poller? For this setting I am using the default which is defined in
> osm_subnet.c as:
> 
> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
>    p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT;
>    p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout
> *p_opt->transaction_retries;
> 
> Would you explain to me what are the advantages or disadvantages of
> OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth
> performance at all?
> 
> I noticed that when using the default setting of 4 I get into the else
> of the above if statement when there are 4 qp0_mads_outstanding. I
> noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get
> the failure I'm mentioning at all. Partly (I think) because I don't
> enter the else in the if statement until there is 1 
qp0_mads_outstanding.
> 
> I hope this explains the problem well enough and it may be a time out
> problem but I'd like to understand why the problem is occurring.
> Thank you very much,
> 
> Hector Abrach
> 
> From:          Hal Rosenstock <hal at dev.mellanox.co.il>
> To:            Hector Abrach <HAbrach at TMRIUSA.COM>
> Cc:            ewg at lists.openfabrics.org
> Date:          12/14/2011 08:03 AM
> Subject:               Re: [ewg] OpenSM 1.5.4 Boot Problem
> 
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> Hi,
> 
> On 12/13/2011 2:35 PM, Hector Abrach wrote:
>> Hello,
>>
>> I have a boot problem with OpenSM
> 
> Are you saying the switch is booted rather than OpenSM ?
> 
> What is the OpenSM running on and in what environment ?
> 
>> the problem occurs seldomly and
>> started to ocur when we started using a new Mellanox MT1118X03342 
switch.
>> The problem occurs during the discovery phase within
> state_mgr_sweep_hop_1.
>>
>> However, I discovered that the actual location is because the
>> qp0_mads_outsanding stalls at 1 occasionally.
> 
> Is it stuck or after timeout/retry does this get updated properly ?
> 
>> Within file osm_vl15intf.c in function vl15_poller it checks at the
>> rfifo and if the qlist still has items it applies function 
vl15_send_mad
>> which later on triggers the signal.
>> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
>> noticed that cl_qlist_end reaches zero before
>> stats->qp0_mads_outstanding does. This causes a stall in
>> cl_event_wait_on. The rfifo always reaches 0 when there are 4
>> qp0_mads_outstanding however when it fails it always fails when there 
is
>> 1 qp0_mad_outstanding.
> 
> Is some (request) SMP that OpenSM sent timing out (not being responded 
to) ?
> 
>> Have you seen this failure? By the way, I see this failure once every 
15
>> reboots approximately.
>>
>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
>> problem.
> 
> What do you mean exactly by fixes the problem ? I'm not sure I
> understand what the problem is yet.
> 
> -- Hal
> 
>> My guess is that there is a race condition when the switch sends 4 SMPs
>> in parallel. Also, this failure only appears to occur at reboot. 
Another
>> solution which is not acceptable is when I add a delay in the process
>> the failure goes away. This as if the switch needed more time to do
>> something.
>>
>> I would really appreciate your help and insight.
>> Thank you
>>
>> Hector Abrach
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud 
service.
>> For more information please visit _http://www.symanteccloud.com_
> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
>>
>>
>> _______________________________________________
>> ewg mailing list
>> ewg at lists.openfabrics.org
>> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud 
service.
> For more information please visit _http://www.symanteccloud.com_
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud 
service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud 
service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> 
______________________________________________________________________[attachment
> "2011-12-13_10-18-25_182.jpg" deleted by Hector Abrach/Software/TMRU]
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud 
service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20111215/a5d8f535/attachment.html>