[ofw] SA retry count and timeout settings?

Jeremy Enos jenos at ncsa.uiuc.edu
Tue Oct 30 12:25:09 PDT 2007


Thank you Tzachi and Fab for the excellent explanations and tools.  Will 
implement the value change to 30 as you suggest.

    Jeremy

Tzachi Dar wrote:
> Hi Jeremy,
>
> I'm not sure why the ipoib adapters didn't get a notification that the
> SM was restarting, but it seems that changing the SA Query Timeout to a
> bigger number should eliminate this problem.
>
> As for doing this on more than one machine:
> Well actually there is nothing to it more than editing the registry in a
> specific location. In any case, I'm attaching to this mail a Perl script
> that does exactly that. (this script changes it for a hard coded machine
> named sw020, but if you know Perl you can easily change that.)
>
> Please also note that if there is a need to disable/enable a device from
> remote than the devman utility can do that easily.
>
> For example:
> devman -m:\\sw020 disable IBA\IPOIB
> devman -m:\\sw020 enable IBA\IPOIB
>
>
>
> As for the checksum handling:
> Let me start by saying that note all Mellanox cards support checksum
> offloading.
> Even worse, the current driver doesn't allow working with checksum
> offloading even if the card supports that.
> So actually disabling checksum on send only means that packets will be
> sent on the wire and will be dropped on the remote machine. Disabling
> checksum on all machines both on send and receive will mean that packets
> will travel the network with bad checksum and will be accepted on the
> remote machine.
> So, is this a good idea? Well since IB has it's own CRC (which is
> actually better than checksum) some believe that if all hosts are
> windows machines than things should work just fine without IP checksum.
> Performance without checksum is also 20-30% better. But this is really
> up to you.
>
> Thanks
> Tzachi
>
>   
>> -----Original Message-----
>> From: ofw-bounces at lists.openfabrics.org 
>> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Jeremy Enos
>> Sent: Tuesday, October 30, 2007 3:31 AM
>> To: Fab Tillier
>> Cc: ofw at lists.openfabrics.org
>> Subject: Re: [ofw] SA retry count and timeout settings?
>>
>> Hi Fab-
>> I neglected to describe my full environment.   Currently, 
>> only 30 of the 
>> 1200 nodes are windows hosts.  The SM used is the Cisco host 
>> based SM (version 1.1, which requires a topspin stack on 
>> those hosts yet).  
>> OpenSM kept blowing up when we tried it I think.
>>
>> Anyway... IocPollInterval is indeed set to zero already.  
>> Here's the full snapshot:
>> [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ibbus\Pa
>> rameters]
>> "DebugFlags"=dword:80000000
>> "ReportPortNIC"=dword:00000001
>> "IbalDebugLevel"=dword:00000002
>> "IbalDebugFlags"=dword:00ffffff
>> "SmiPollInterval"=dword:00004e20
>> "IocQueryTimeout"=dword:000000fa
>> "IocQueryRetries"=dword:00000004
>> "IocPollInterval"=dword:00000000
>>
>> In the IPoIB adapter properties:
>> SA Query Retry Count = 10
>> SA Query Timeout (ms) = 1000
>>
>> Resetting any of those in a non-batch method just isn't 
>> feasible... even at 30 nodes, we aren't designing in any 
>> non-scalable solutions.  ;-)
>>
>> I also noticed in there that Send Checksum Offload is 
>> disabled by default... should that be enabled?  Or does it 
>> have some drawback?
>>
>> Back to the original problem though... given the setting 
>> above, the fact that we're using Cisco's host based SM v 1.1 
>> running on a topspin stack, is there any explanation as to 
>> why these things didn't reconnect when the SM came back 
>> online?  I guess all the Linux hosts returned just fine.  
>> What should I change?
>> thx-
>>
>>     Jeremy
>>
>> p.s.  We'll be moving to Cisco's 1.2 SM soon so that it can 
>> run on OFED based hosts and we can ditch any TS stack 
>> requirement.  Just informing you of that in case you think 
>> this is an SM related issue.
>>
>> Fab Tillier wrote:
>>     
>>> When the SM comes back up, it's supposed to set the 'Client 
>>>       
>> Reregister' bit for the ports it configures that were already 
>> in the active state.  This bit triggers an event on the 
>> recipient that will get to IPoIB and cause it to retry it's 
>> 'login' logic with the SM.  Client reregister should be 
>> supported in OpenSM, though maybe it can't handle all hosts 
>> trying to log in at once.
>>     
>>> Have you tried disabling the I/O Controller scanning (under 
>>>       
>> the ibbus 
>>     
>>> service parameters in the registry, set IocPollInterval to 
>>>       
>> zero - it 
>>     
>>> defaults to 30000).  This will get reset every time you update 
>>> drivers, and is really only needed if you have IB attached 
>>>       
>> storage or 
>>     
>>> networking bridges (which I don't think you do, do you?)
>>>
>>> Having a non-zero interval here causes the SA to really get 
>>>       
>> pounded on, and leads to timeouts.
>>     
>>> If you were to change parameters for IPoIB, I'd change the retry 
>>> count.  Changing it is a real pain, as far as I know you do 
>>>       
>> it through 
>>     
>>> the device manager UI.  I don't think you want to do this on 1200 
>>> nodes. :)
>>>
>>> -Fab
>>>
>>> -----Original Message-----
>>> From: ofw-bounces at lists.openfabrics.org 
>>> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Jeremy Enos
>>> Sent: Monday, October 29, 2007 3:55 PM
>>> To: ofw at lists.openfabrics.org
>>> Subject: [ofw] SA retry count and timeout settings?
>>>
>>> Hi-
>>> I found all my windows hosts in a state where their IPoIB 
>>>       
>> adapter was 
>>     
>>> disconnected, and I had to either disable/enable the device 
>>>       
>> or reboot 
>>     
>>> the host to get it back.
>>>
>>> I discovered that an SM restart occurred around the time 
>>>       
>> the network 
>>     
>>> dropped off, and probably took 5 minutes or so to come back 
>>>       
>> up (1200 
>>     
>>> HCAs to map in).
>>>
>>> The Event Log confirmed this correlation:
>>> OpenIB IPoIB Adapter #5: Subnet Administrator query for port 
>>> information timed out.  Make sure the SA is functioning properly.  
>>> Increasing the number of retries and retry timeout adapter 
>>>       
>> parameters may solve the issue.
>>     
>>> What I want to know is- where I can find these parameters 
>>>       
>> to change, 
>>     
>>> and what I should change them to.  Any reason it shouldn't simply 
>>> retry infinitely?  I'm using driver version 614.
>>>
>>> thx-
>>>
>>>     Jeremy Enos
>>>
>>> _______________________________________________
>>> ofw mailing list
>>> ofw at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>>>   
>>>       
>> _______________________________________________
>> ofw mailing list
>> ofw at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>>     



More information about the ofw mailing list