[ewg] OpenSM 1.5.4 Boot Problem

Hal Rosenstock hal at dev.mellanox.co.il
Fri Dec 16 09:11:48 PST 2011


Hector,

On 12/16/2011 11:59 AM, Hector Abrach wrote:
> Hal,
> 
>> Is timeout/retry/send error support implemented in your QNX
>> implementation ? That would explain why the SM appears to stop...
> 
> Based on the inherit nature of the QNX Kernel I don't believe we have a
> timeout/retry/send on it. This may be the reason I see the bootup
> freeze. If it is I may have to implement this somehow.

I think that's the OpenSM side of the failure as a timed out transaction
never times out so the MAD accounting is wrong, etc. It breaks that
fundamental assumption.

There may also be some issue with the SMA implementation on your QNX
nodes which is the root cause. Of course, SMPs are unreliable so
timeout/retries can be needed...

> However, for the time being at least, I believe that setting
> OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it
> works reliably. But, it would be nice to know why it freezes anyway, may
> be because of the above.
> 
> Thus far I've been unsuccessful in failing with debug property -D 0x23
> but I'll keep trying.

That slows things down enough to make it work as does 1 SMP outstanding.
It appears when SMPs are pipelined, some get dropped...

-- Hal

> Thank you
> 
> Hector Abrach
> 
> 
> From: 	Hal Rosenstock <hal at dev.mellanox.co.il>
> To: 	Hector Abrach <HAbrach at TMRIUSA.COM>
> Date: 	12/15/2011 01:21 PM
> Subject: 	Re: [ewg] OpenSM 1.5.4 Boot Problem
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> On 12/15/2011 1:57 PM, Hector Abrach wrote:
>> Hal,
>>
>> I managed to get it to fail with Debug information -D 0x08. Attached is
>> the log file.
>> I'll dig deeper it seems is pkey related maybe...
> 
> Yes, I saw signs of that last night from the log you sent where it
> stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
> was that or not. I didn't check how many pairs of the pkey tables you
> got back here to validate whether every port responded with the proper
> number of pkey table blocks.
> 
> Is timeout/retry/send error support implemented in your QNX
> implementation ? That would explain why the SM appears to stop...
> 
> -- Hal
> 
>> Once again thank you for your support.
>>
>>
>>
>> Hector Abrach
>>
>>
>> From:                  Hal Rosenstock <hal at dev.mellanox.co.il>
>> To:                  Hector Abrach <HAbrach at TMRIUSA.COM>
>> Date:                  12/14/2011 08:29 PM
>> Subject:                  Re: [ewg] OpenSM 1.5.4 Boot Problem
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> Hector,
>>
>> On 12/14/2011 5:49 PM, Hector Abrach wrote:
>>> Hal,
>>>
>>> I got the system to fail with verbose enabled after 25 reboots. Please
>>> find attached the log file.
>>>
>>
>> I can see the responses but not the requests. What verbosity level did
>> you use ?
>>
>>> I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the
>>> boot process in multi-switch systems and make the boot process faster
>>> correct?
>>
>> It's multinode not just multiswitch and this configuration is 8 nodes (1
>> switch + 7 CAs). It's not boot process but discovery/initialization
>> which is pipelined.
>>
>>> Since my system is a single switch system I do not need to have
>>> 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE.
>>
>> You can run with 1 if that suits your needs. It's just not the default.
>>
>>> Maybe the pipelined SMP's are confusing the switch some how.
>>
>> Even if it did, there's nothing that should "stop" the SM from
>> working/proceeding. From the log, it looks like the SM does get stuck.
>>
>> -- Hal
>>
>>> Thanks again for your help.
>>>
>>> Hector Abrach
>>>
>>>
>>> From:                  Hal Rosenstock <hal at dev.mellanox.co.il>
>>> To:                  Hector Abrach <HAbrach at TMRIUSA.COM>
>>> Cc:                  ewg at lists.openfabrics.org
>>> Date:                  12/14/2011 08:03 AM
>>> Subject:                  Re: [ewg] OpenSM 1.5.4 Boot Problem
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>> Hi,
>>>
>>> On 12/13/2011 2:35 PM, Hector Abrach wrote:
>>>> Hello,
>>>>
>>>> I have a boot problem with OpenSM
>>>
>>> Are you saying the switch is booted rather than OpenSM ?
>>>
>>> What is the OpenSM running on and in what environment ?
>>>
>>>> the problem occurs seldomly and
>>>> started to ocur when we started using a new Mellanox MT1118X03342
> switch.
>>>> The problem occurs during the discovery phase within
>>> state_mgr_sweep_hop_1.
>>>>
>>>> However, I discovered that the actual location is because the
>>>> qp0_mads_outsanding stalls at 1 occasionally.
>>>
>>> Is it stuck or after timeout/retry does this get updated properly ?
>>>
>>>> Within file osm_vl15intf.c in function vl15_poller it checks at the
>>>> rfifo and if the qlist still has items it applies function vl15_send_mad
>>>> which later on triggers the signal.
>>>> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
>>>> noticed that cl_qlist_end reaches zero before
>>>> stats->qp0_mads_outstanding does. This causes a stall in
>>>> cl_event_wait_on. The rfifo always reaches 0 when there are 4
>>>> qp0_mads_outstanding however when it fails it always fails when there is
>>>> 1 qp0_mad_outstanding.
>>>
>>> Is some (request) SMP that OpenSM sent timing out (not being responded
>> to) ?
>>>
>>>> Have you seen this failure? By the way, I see this failure once every 15
>>>> reboots approximately.
>>>>
>>>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
>>>> problem.
>>>
>>> What do you mean exactly by fixes the problem ? I'm not sure I
>>> understand what the problem is yet.
>>>
>>> -- Hal
>>>
>>>> My guess is that there is a race condition when the switch sends 4 SMPs
>>>> in parallel. Also, this failure only appears to occur at reboot. Another
>>>> solution which is not acceptable is when I add a delay in the process
>>>> the failure goes away. This as if the switch needed more time to do
>>>> something.
>>>>
>>>> I would really appreciate your help and insight.
>>>> Thank you
>>>>
>>>> Hector Abrach
>>>> ______________________________________________________________________
>>>> This email has been scanned by the Symantec Email Security.cloud
> service.
>>>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>> <http://www.symanteccloud.com/>
>>>> ______________________________________________________________________
>>>>
>>>>
>>>> _______________________________________________
>>>> ewg mailing list
>>>> ewg at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the Symantec Email Security.cloud service.
>>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>> <http://www.symanteccloud.com/>
>>> ______________________________________________________________________
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the Symantec Email Security.cloud service.
>>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>> ______________________________________________________________________
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________




More information about the ewg mailing list