[ewg] OpenSM 1.5.4 Boot Problem

Hal Rosenstock hal at dev.mellanox.co.il
Thu Dec 15 11:05:31 PST 2011


Hector,

On 12/15/2011 12:49 PM, Hector Abrach wrote:
> Hal,
> 
> Thank you for the response. To address your questions:
> 
>> So the switch stays up and the servers (including the one OpenSM is on)
>> is rebooted, right ?
> 
> Right.
> 
>> Do the servers run QNX rather than Linux ? Are you saying all OpenSM
>> code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?
> 
> Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only
> changes I had to make were made to some #define libraries.
> The big changes were made for the driver, not so much OpenSM. 

I would think there are also changes for porting of complib to QNX. Do
you use osm_vendor_ibumad.c as the OpenSM vendor layer ?

> I'm using IBNet 1.3. 

What's IBNet 1.3 ? I'm not familiar with that.

> OpenSM always runs on the same one server, the others don't
> run it.

Understood.

>> Is the topology the 7 servers and the 1 switch and if you use other
>> switches you don't see this issue ?
> 
> That's correct, the topology is 7 servers and 1 switch. We typically use
> less servers (4) for our application but the problem is more easily
> reproducible with more servers so we have a 7 server setup with 1
> switch. We don't have a great selection of switches but I know our
> previous switch did not cause this problem. Our intention is to go to
> production with this new switch but we can't release until we find an
> acceptable solution.
> 
>>Ican see the responses but not the requests. What verbosity level did
>> you use ?
> 
> I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want to
> do -D 0xFF because I know this fixes the problem for sure.

I think -D 0x23 (error, info, frames) would do the trick...

> -------------------------
> 
> In summary:
> 1.        knowing that the system gets stuck for sm_vendor_ibumad.c ->
> umad_receiver() -> "for(;;)" but keeps running properly for function
> main.c -> osm_manager_loop().
> 2.        If I use -D 0xFF the problem is completely fixed
> 3.        if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other
> value the problem is completely fixed
> 4.        The failure always occurs with qp0_mads_outstanding of 1
> remaining
> what do you think could be wrong?
> Do you think the driver could be the problem?

Yes; The thing that I think is a likely suspect and may be missing and
causing this issue is the (built in to kernel MAD in Linux) timeout
retry code for MAD transactions which if the timeout/retries are
exhaused triggers a send error (callback). Is that implemented ?

However, I don't have a good explanation for why you see this now and
not before with your other switches but maybe that's not important.

> What debug command should I use to see the sent requests?

See above.

-- Hal

> Thank you
> 
> Hector Abrach
> 
> 
> 
> 
> From: 	Hal Rosenstock <hal at dev.mellanox.co.il>
> To: 	Hector Abrach <HAbrach at TMRIUSA.COM>
> Cc: 	ewg at lists.openfabrics.org
> Date: 	12/14/2011 08:23 PM
> Subject: 	Re: [ewg] OpenSM 1.5.4 Boot Problem
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> Hector,
> 
> On 12/14/2011 1:41 PM, Hector Abrach wrote:
>> Hal,
>>
>> Sorry for the multiple emails, but I was thinking how it may be a
>> "freeze /stall" rather than a time out.  One reason is that it doesn't
>> send an error message, is as if the log completely dies.
> 
> So nothing interesting in the log...
> 
>> However, in
>> file osm_vendor_ibumad.c under function umad_receiver there is an
>> infinite loop "for(;;)" which seems to die when I get to that previously
>> discussed vl15_poller. I checked to see if it breaks out of the loop but
>> it doesn't seem to.
> 
> It never breaks out of that loop except when OpenSM is shutting down.
> That's the basic receive loop.
> 
> -- Hal
> 
>> I'm not sure if this may be an additional hint.
>> Thank you
>>
>> Hector Abrach
>>
>>
>> From:                  Hector Abrach <HAbrach at TMRIUSA.COM>
>> To:                  Hal Rosenstock <hal at dev.mellanox.co.il>
>> Cc:                  ewg at lists.openfabrics.org
>> Date:                  12/14/2011 11:15 AM
>> Subject:                  Re: [ewg] OpenSM 1.5.4 Boot Problem
>> Sent by:                  ewg-bounces at lists.openfabrics.org
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> Hal,
>>
>> Thank you very much for the support, I am the same person from the gmail
>> account so I will respond through here.
>>
>> Attached is a picture of the switch serial number:
>>
>>
>>
>> I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
>> system which I reboot via a script over and over again. Technically
>> speaking the switch is not being powered off or physically rebooted. My
>> server system is what is being rebooted. I am running OpenSM on one of
>> the 7 servers. This means I'm constantly shutting down and rebooting
>> OpenSM. I am running OpenSM on QNX but we have not had this problem
>> until we decided to upgrade to this switch.
>>
>> The problem is that every 1 out of 15 of this remote reboots OpenSM
>> stalls or times out because stats->qp0_mads_outstanding did not reach
>> zero. Please excuse my ignorance as I'm relatively new at this but how
>> do I verify if it is a timeout problem vs a stall?
>>
>> You also mentioned that you'd like to see the Verbose output of openSM;
>> however, when I run in Verbose mode I don't see the problem. It appears
>> as if the verbose output stalls enough time to give the switch time to
>> do what ever it needs to do and hence not have the problem occur. But
>> this is the last I see when the problem occurs:
>>
>>
>>
>> -------------------------------------------------
>> OpenSM 3.3.12
>> Command Line Arguments:
>> Log file max size is 5 MBytes
>> Log File: /tmp/opensm.log
>> -------------------------------------------------
>> OpenSM 3.3.12
>>
>> Entering DISCOVERING state
>>
>> Using default GUID 0x2c9020023277d
>>
>>
>>
>> The problem occurs in function osm_vl15intf.c -> vl15_poller in the else
>> statement.
>>
>> if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
>>        OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG,
>>        "Servicing p_madw = %p\n", p_madw);
>>        if (osm_log_is_active(p_vl->p_log, OSM_LOG_FRAMES))
>>        osm_dump_dr_smp(p_vl->p_log,
>>        osm_madw_get_smp_ptr(p_madw),
>>        OSM_LOG_FRAMES);
>>
>>        vl15_send_mad(p_vl, p_madw);
>> } else
>>        /*
>>           The VL15 FIFO is empty, so we have nothing left to do.
>>         */
>>        status = cl_event_wait_on(&p_vl->signal,
>>                  EVENT_NO_TIMEOUT, TRUE);
>>
>> It won't move forward from the cl_event_wait_on in this line of code.
>> However, there are other locations such as wait_for_pending_transactions
>> in the do_sweep function that won't move forward from. But I believe
>> this to be a side effect of the problem I'm mentioning.
>>
>> When you mention what is my timeout, I'm guessing you refer to
>> max_smps_timeout which is used in the second while loop within
>> vl15_poller? For this setting I am using the default which is defined in
>> osm_subnet.c as:
>>
>> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
>>    p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT;
>>    p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout
>> *p_opt->transaction_retries;
>>
>> Would you explain to me what are the advantages or disadvantages of
>> OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth
>> performance at all?
>>
>> I noticed that when using the default setting of 4 I get into the else
>> of the above if statement when there are 4 qp0_mads_outstanding. I
>> noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get
>> the failure I'm mentioning at all. Partly (I think) because I don't
>> enter the else in the if statement until there is 1 qp0_mads_outstanding.
>>
>> I hope this explains the problem well enough and it may be a time out
>> problem but I'd like to understand why the problem is occurring.
>> Thank you very much,
>>
>> Hector Abrach
>>
>> From:                 Hal Rosenstock <hal at dev.mellanox.co.il>
>> To:                 Hector Abrach <HAbrach at TMRIUSA.COM>
>> Cc:                 ewg at lists.openfabrics.org
>> Date:                 12/14/2011 08:03 AM
>> Subject:                 Re: [ewg] OpenSM 1.5.4 Boot Problem
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> Hi,
>>
>> On 12/13/2011 2:35 PM, Hector Abrach wrote:
>>> Hello,
>>>
>>> I have a boot problem with OpenSM
>>
>> Are you saying the switch is booted rather than OpenSM ?
>>
>> What is the OpenSM running on and in what environment ?
>>
>>> the problem occurs seldomly and
>>> started to ocur when we started using a new Mellanox MT1118X03342 switch.
>>> The problem occurs during the discovery phase within
>> state_mgr_sweep_hop_1.
>>>
>>> However, I discovered that the actual location is because the
>>> qp0_mads_outsanding stalls at 1 occasionally.
>>
>> Is it stuck or after timeout/retry does this get updated properly ?
>>
>>> Within file osm_vl15intf.c in function vl15_poller it checks at the
>>> rfifo and if the qlist still has items it applies function vl15_send_mad
>>> which later on triggers the signal.
>>> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
>>> noticed that cl_qlist_end reaches zero before
>>> stats->qp0_mads_outstanding does. This causes a stall in
>>> cl_event_wait_on. The rfifo always reaches 0 when there are 4
>>> qp0_mads_outstanding however when it fails it always fails when there is
>>> 1 qp0_mad_outstanding.
>>
>> Is some (request) SMP that OpenSM sent timing out (not being responded
> to) ?
>>
>>> Have you seen this failure? By the way, I see this failure once every 15
>>> reboots approximately.
>>>
>>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
>>> problem.
>>
>> What do you mean exactly by fixes the problem ? I'm not sure I
>> understand what the problem is yet.
>>
>> -- Hal
>>
>>> My guess is that there is a race condition when the switch sends 4 SMPs
>>> in parallel. Also, this failure only appears to occur at reboot. Another
>>> solution which is not acceptable is when I add a delay in the process
>>> the failure goes away. This as if the switch needed more time to do
>>> something.
>>>
>>> I would really appreciate your help and insight.
>>> Thank you
>>>
>>> Hector Abrach
>>> ______________________________________________________________________
>>> This email has been scanned by the Symantec Email Security.cloud service.
>>> For more information please visit _http://www.symanteccloud.com_
>> <http://www.symanteccloud.com/>
>>> ______________________________________________________________________
>>>
>>>
>>> _______________________________________________
>>> ewg mailing list
>>> ewg at lists.openfabrics.org
>>> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit _http://www.symanteccloud.com_
>> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>
> ______________________________________________________________________[attachment
>> "2011-12-13_10-18-25_182.jpg" deleted by Hector Abrach/Software/TMRU]
>> _______________________________________________
>> ewg mailing list
>> ewg at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________




More information about the ewg mailing list