[ofw] SA retry count and timeout settings?
Jeremy Enos
jenos at ncsa.uiuc.edu
Tue Oct 30 12:25:09 PDT 2007
Thank you Tzachi and Fab for the excellent explanations and tools. Will
implement the value change to 30 as you suggest.
Jeremy
Tzachi Dar wrote:
> Hi Jeremy,
>
> I'm not sure why the ipoib adapters didn't get a notification that the
> SM was restarting, but it seems that changing the SA Query Timeout to a
> bigger number should eliminate this problem.
>
> As for doing this on more than one machine:
> Well actually there is nothing to it more than editing the registry in a
> specific location. In any case, I'm attaching to this mail a Perl script
> that does exactly that. (this script changes it for a hard coded machine
> named sw020, but if you know Perl you can easily change that.)
>
> Please also note that if there is a need to disable/enable a device from
> remote than the devman utility can do that easily.
>
> For example:
> devman -m:\\sw020 disable IBA\IPOIB
> devman -m:\\sw020 enable IBA\IPOIB
>
>
>
> As for the checksum handling:
> Let me start by saying that note all Mellanox cards support checksum
> offloading.
> Even worse, the current driver doesn't allow working with checksum
> offloading even if the card supports that.
> So actually disabling checksum on send only means that packets will be
> sent on the wire and will be dropped on the remote machine. Disabling
> checksum on all machines both on send and receive will mean that packets
> will travel the network with bad checksum and will be accepted on the
> remote machine.
> So, is this a good idea? Well since IB has it's own CRC (which is
> actually better than checksum) some believe that if all hosts are
> windows machines than things should work just fine without IP checksum.
> Performance without checksum is also 20-30% better. But this is really
> up to you.
>
> Thanks
> Tzachi
>
>
>> -----Original Message-----
>> From: ofw-bounces at lists.openfabrics.org
>> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Jeremy Enos
>> Sent: Tuesday, October 30, 2007 3:31 AM
>> To: Fab Tillier
>> Cc: ofw at lists.openfabrics.org
>> Subject: Re: [ofw] SA retry count and timeout settings?
>>
>> Hi Fab-
>> I neglected to describe my full environment. Currently,
>> only 30 of the
>> 1200 nodes are windows hosts. The SM used is the Cisco host
>> based SM (version 1.1, which requires a topspin stack on
>> those hosts yet).
>> OpenSM kept blowing up when we tried it I think.
>>
>> Anyway... IocPollInterval is indeed set to zero already.
>> Here's the full snapshot:
>> [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ibbus\Pa
>> rameters]
>> "DebugFlags"=dword:80000000
>> "ReportPortNIC"=dword:00000001
>> "IbalDebugLevel"=dword:00000002
>> "IbalDebugFlags"=dword:00ffffff
>> "SmiPollInterval"=dword:00004e20
>> "IocQueryTimeout"=dword:000000fa
>> "IocQueryRetries"=dword:00000004
>> "IocPollInterval"=dword:00000000
>>
>> In the IPoIB adapter properties:
>> SA Query Retry Count = 10
>> SA Query Timeout (ms) = 1000
>>
>> Resetting any of those in a non-batch method just isn't
>> feasible... even at 30 nodes, we aren't designing in any
>> non-scalable solutions. ;-)
>>
>> I also noticed in there that Send Checksum Offload is
>> disabled by default... should that be enabled? Or does it
>> have some drawback?
>>
>> Back to the original problem though... given the setting
>> above, the fact that we're using Cisco's host based SM v 1.1
>> running on a topspin stack, is there any explanation as to
>> why these things didn't reconnect when the SM came back
>> online? I guess all the Linux hosts returned just fine.
>> What should I change?
>> thx-
>>
>> Jeremy
>>
>> p.s. We'll be moving to Cisco's 1.2 SM soon so that it can
>> run on OFED based hosts and we can ditch any TS stack
>> requirement. Just informing you of that in case you think
>> this is an SM related issue.
>>
>> Fab Tillier wrote:
>>
>>> When the SM comes back up, it's supposed to set the 'Client
>>>
>> Reregister' bit for the ports it configures that were already
>> in the active state. This bit triggers an event on the
>> recipient that will get to IPoIB and cause it to retry it's
>> 'login' logic with the SM. Client reregister should be
>> supported in OpenSM, though maybe it can't handle all hosts
>> trying to log in at once.
>>
>>> Have you tried disabling the I/O Controller scanning (under
>>>
>> the ibbus
>>
>>> service parameters in the registry, set IocPollInterval to
>>>
>> zero - it
>>
>>> defaults to 30000). This will get reset every time you update
>>> drivers, and is really only needed if you have IB attached
>>>
>> storage or
>>
>>> networking bridges (which I don't think you do, do you?)
>>>
>>> Having a non-zero interval here causes the SA to really get
>>>
>> pounded on, and leads to timeouts.
>>
>>> If you were to change parameters for IPoIB, I'd change the retry
>>> count. Changing it is a real pain, as far as I know you do
>>>
>> it through
>>
>>> the device manager UI. I don't think you want to do this on 1200
>>> nodes. :)
>>>
>>> -Fab
>>>
>>> -----Original Message-----
>>> From: ofw-bounces at lists.openfabrics.org
>>> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Jeremy Enos
>>> Sent: Monday, October 29, 2007 3:55 PM
>>> To: ofw at lists.openfabrics.org
>>> Subject: [ofw] SA retry count and timeout settings?
>>>
>>> Hi-
>>> I found all my windows hosts in a state where their IPoIB
>>>
>> adapter was
>>
>>> disconnected, and I had to either disable/enable the device
>>>
>> or reboot
>>
>>> the host to get it back.
>>>
>>> I discovered that an SM restart occurred around the time
>>>
>> the network
>>
>>> dropped off, and probably took 5 minutes or so to come back
>>>
>> up (1200
>>
>>> HCAs to map in).
>>>
>>> The Event Log confirmed this correlation:
>>> OpenIB IPoIB Adapter #5: Subnet Administrator query for port
>>> information timed out. Make sure the SA is functioning properly.
>>> Increasing the number of retries and retry timeout adapter
>>>
>> parameters may solve the issue.
>>
>>> What I want to know is- where I can find these parameters
>>>
>> to change,
>>
>>> and what I should change them to. Any reason it shouldn't simply
>>> retry infinitely? I'm using driver version 614.
>>>
>>> thx-
>>>
>>> Jeremy Enos
>>>
>>> _______________________________________________
>>> ofw mailing list
>>> ofw at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>>>
>>>
>> _______________________________________________
>> ofw mailing list
>> ofw at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>>
More information about the ofw
mailing list