[ewg] OpenSM 1.5.4 Boot Problem

Thu Dec 22 05:24:00 PST 2011

Hector,

On 12/21/2011 2:16 PM, Hector Abrach wrote:
> Hal,
> 
> When an SMP times out how does the Linux kernel know it timed out?

The kernel MAD module maintains a timer list for outstanding
transactions and if no response is received before the timer expires, it
knows that transaction timed out. If the matching response is received,
it removes that transaction from that list. See timeout_sends in
drivers/infiniband/core/mad.c

> When the Linux kernel determines it timed out how does it signal OpenSM
> the timeout/retry/send? Through what function calls does this signal go
> through?
> 
> I was noticing that cl_event_wait_on in vl15_poller() has a parameter
> passed as EVENT_NO_TIMEOUT should this be a time out or should the time
> out occur somewhere else? This is where it "stalls."

Yes, I've already responded about this several times. I'm reasonably
sure that this is due to erroneous QP0/VL15 accounting due to lack of
timeouts.

> Do you know somewhere I could read a little bit more about the Linux
> Kernel timeout and how it interacts with OpenSM?

In terms of the kernel, look at:
linux/Documentation/infiniband/user_mad.txt
and
include/rdma/ib_mad.h and ib_user_mad.h

OpenSM uses osm_vendor_ibumad.c which is layered on top of libibumad. In
osm_vendor_ibumad.c, the send error callback is invoked for transaction
timeout in umad_receiver. For libibumad, see umad_status and umad_send
man pages.

-- Hal

> Thank you for the help and your insight.
> 
> Hector Abrach
> 
> 
> From: 	Hal Rosenstock <hal at dev.mellanox.co.il>
> To: 	Hector Abrach <HAbrach at TMRIUSA.COM>
> Cc: 	ewg at lists.openfabrics.org
> Date: 	12/16/2011 11:11 AM
> Subject: 	Re: [ewg] OpenSM 1.5.4 Boot Problem
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> Hector,
> 
> On 12/16/2011 11:59 AM, Hector Abrach wrote:
>> Hal,
>>
>>> Is timeout/retry/send error support implemented in your QNX
>>> implementation ? That would explain why the SM appears to stop...
>>
>> Based on the inherit nature of the QNX Kernel I don't believe we have a
>> timeout/retry/send on it. This may be the reason I see the bootup
>> freeze. If it is I may have to implement this somehow.
> 
> I think that's the OpenSM side of the failure as a timed out transaction
> never times out so the MAD accounting is wrong, etc. It breaks that
> fundamental assumption.
> 
> There may also be some issue with the SMA implementation on your QNX
> nodes which is the root cause. Of course, SMPs are unreliable so
> timeout/retries can be needed...
> 
>> However, for the time being at least, I believe that setting
>> OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it
>> works reliably. But, it would be nice to know why it freezes anyway, may
>> be because of the above.
>>
>> Thus far I've been unsuccessful in failing with debug property -D 0x23
>> but I'll keep trying.
> 
> That slows things down enough to make it work as does 1 SMP outstanding.
> It appears when SMPs are pipelined, some get dropped...
> 
> -- Hal
> 
>> Thank you
>>
>> Hector Abrach
>>
>>
>> From:                  Hal Rosenstock <hal at dev.mellanox.co.il>
>> To:                  Hector Abrach <HAbrach at TMRIUSA.COM>
>> Date:                  12/15/2011 01:21 PM
>> Subject:                  Re: [ewg] OpenSM 1.5.4 Boot Problem
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> On 12/15/2011 1:57 PM, Hector Abrach wrote:
>>> Hal,
>>>
>>> I managed to get it to fail with Debug information -D 0x08. Attached is
>>> the log file.
>>> I'll dig deeper it seems is pkey related maybe...
>>
>> Yes, I saw signs of that last night from the log you sent where it
>> stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
>> was that or not. I didn't check how many pairs of the pkey tables you
>> got back here to validate whether every port responded with the proper
>> number of pkey table blocks.
>>
>> Is timeout/retry/send error support implemented in your QNX
>> implementation ? That would explain why the SM appears to stop...
>>
>> -- Hal
>>
>>> Once again thank you for your support.
>>>
>>>
>>>
>>> Hector Abrach
>>>
>>>
>>> From:                  Hal Rosenstock <hal at dev.mellanox.co.il>
>>> To:                  Hector Abrach <HAbrach at TMRIUSA.COM>
>>> Date:                  12/14/2011 08:29 PM
>>> Subject:                  Re: [ewg] OpenSM 1.5.4 Boot Problem
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>> Hector,
>>>
>>> On 12/14/2011 5:49 PM, Hector Abrach wrote:
>>>> Hal,
>>>>
>>>> I got the system to fail with verbose enabled after 25 reboots. Please
>>>> find attached the log file.
>>>>
>>>
>>> I can see the responses but not the requests. What verbosity level did
>>> you use ?
>>>
>>>> I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the
>>>> boot process in multi-switch systems and make the boot process faster
>>>> correct?
>>>
>>> It's multinode not just multiswitch and this configuration is 8 nodes (1
>>> switch + 7 CAs). It's not boot process but discovery/initialization
>>> which is pipelined.
>>>
>>>> Since my system is a single switch system I do not need to have
>>>> 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE.
>>>
>>> You can run with 1 if that suits your needs. It's just not the default.
>>>
>>>> Maybe the pipelined SMP's are confusing the switch some how.
>>>
>>> Even if it did, there's nothing that should "stop" the SM from
>>> working/proceeding. From the log, it looks like the SM does get stuck.
>>>
>>> -- Hal
>>>
>>>> Thanks again for your help.
>>>>
>>>> Hector Abrach
>>>>
>>>>
>>>> From:                  Hal Rosenstock <hal at dev.mellanox.co.il>
>>>> To:                  Hector Abrach <HAbrach at TMRIUSA.COM>
>>>> Cc:                  ewg at lists.openfabrics.org
>>>> Date:                  12/14/2011 08:03 AM
>>>> Subject:                  Re: [ewg] OpenSM 1.5.4 Boot Problem
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>> On 12/13/2011 2:35 PM, Hector Abrach wrote:
>>>>> Hello,
>>>>>
>>>>> I have a boot problem with OpenSM
>>>>
>>>> Are you saying the switch is booted rather than OpenSM ?
>>>>
>>>> What is the OpenSM running on and in what environment ?
>>>>
>>>>> the problem occurs seldomly and
>>>>> started to ocur when we started using a new Mellanox MT1118X03342
>> switch.
>>>>> The problem occurs during the discovery phase within
>>>> state_mgr_sweep_hop_1.
>>>>>
>>>>> However, I discovered that the actual location is because the
>>>>> qp0_mads_outsanding stalls at 1 occasionally.
>>>>
>>>> Is it stuck or after timeout/retry does this get updated properly ?
>>>>
>>>>> Within file osm_vl15intf.c in function vl15_poller it checks at the
>>>>> rfifo and if the qlist still has items it applies function
> vl15_send_mad
>>>>> which later on triggers the signal.
>>>>> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
>>>>> noticed that cl_qlist_end reaches zero before
>>>>> stats->qp0_mads_outstanding does. This causes a stall in
>>>>> cl_event_wait_on. The rfifo always reaches 0 when there are 4
>>>>> qp0_mads_outstanding however when it fails it always fails when
> there is
>>>>> 1 qp0_mad_outstanding.
>>>>
>>>> Is some (request) SMP that OpenSM sent timing out (not being responded
>>> to) ?
>>>>
>>>>> Have you seen this failure? By the way, I see this failure once
> every 15
>>>>> reboots approximately.
>>>>>
>>>>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
>>>>> problem.
>>>>
>>>> What do you mean exactly by fixes the problem ? I'm not sure I
>>>> understand what the problem is yet.
>>>>
>>>> -- Hal
>>>>
>>>>> My guess is that there is a race condition when the switch sends 4 SMPs
>>>>> in parallel. Also, this failure only appears to occur at reboot.
> Another
>>>>> solution which is not acceptable is when I add a delay in the process
>>>>> the failure goes away. This as if the switch needed more time to do
>>>>> something.
>>>>>
>>>>> I would really appreciate your help and insight.
>>>>> Thank you
>>>>>
>>>>> Hector Abrach
>>>>> ______________________________________________________________________
>>>>> This email has been scanned by the Symantec Email Security.cloud
>> service.
>>>>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>> <http://www.symanteccloud.com/>
>>>> <http://www.symanteccloud.com/>
>>>>> ______________________________________________________________________
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ewg mailing list
>>>>> ewg at lists.openfabrics.org
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>>>
>>>>
>>>> ______________________________________________________________________
>>>> This email has been scanned by the Symantec Email Security.cloud
> service.
>>>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>> <http://www.symanteccloud.com/>
>>>> <http://www.symanteccloud.com/>
>>>> ______________________________________________________________________
>>>>
>>>>
>>>> ______________________________________________________________________
>>>> This email has been scanned by the Symantec Email Security.cloud
> service.
>>>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>> <http://www.symanteccloud.com/>
>>>> ______________________________________________________________________
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the Symantec Email Security.cloud service.
>>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>> <http://www.symanteccloud.com/>
>>> ______________________________________________________________________
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the Symantec Email Security.cloud service.
>>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>>> ______________________________________________________________________
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
>> ______________________________________________________________________
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> <http://www.symanteccloud.com/>
> ______________________________________________________________________
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________