[ofw] SA retry count and timeout settings?

Jeremy Enos jenos at ncsa.uiuc.edu
Mon Oct 29 18:30:38 PDT 2007


Hi Fab-
I neglected to describe my full environment.   Currently, only 30 of the 
1200 nodes are windows hosts.  The SM used is the Cisco host based SM 
(version 1.1, which requires a topspin stack on those hosts yet).  
OpenSM kept blowing up when we tried it I think.

Anyway... IocPollInterval is indeed set to zero already.  Here's the 
full snapshot:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ibbus\Parameters]
"DebugFlags"=dword:80000000
"ReportPortNIC"=dword:00000001
"IbalDebugLevel"=dword:00000002
"IbalDebugFlags"=dword:00ffffff
"SmiPollInterval"=dword:00004e20
"IocQueryTimeout"=dword:000000fa
"IocQueryRetries"=dword:00000004
"IocPollInterval"=dword:00000000

In the IPoIB adapter properties:
SA Query Retry Count = 10
SA Query Timeout (ms) = 1000

Resetting any of those in a non-batch method just isn't feasible... even 
at 30 nodes, we aren't designing in any non-scalable solutions.  ;-)

I also noticed in there that Send Checksum Offload is disabled by 
default... should that be enabled?  Or does it have some drawback?

Back to the original problem though... given the setting above, the fact 
that we're using Cisco's host based SM v 1.1 running on a topspin stack, 
is there any explanation as to why these things didn't reconnect when 
the SM came back online?  I guess all the Linux hosts returned just 
fine.  What should I change?
thx-

    Jeremy

p.s.  We'll be moving to Cisco's 1.2 SM soon so that it can run on OFED 
based hosts and we can ditch any TS stack requirement.  Just informing 
you of that in case you think this is an SM related issue.

Fab Tillier wrote:
> When the SM comes back up, it's supposed to set the 'Client Reregister' bit for the ports it configures that were already in the active state.  This bit triggers an event on the recipient that will get to IPoIB and cause it to retry it's 'login' logic with the SM.  Client reregister should be supported in OpenSM, though maybe it can't handle all hosts trying to log in at once.
>
> Have you tried disabling the I/O Controller scanning (under the ibbus service parameters in the registry, set IocPollInterval to zero - it defaults to 30000).  This will get reset every time you update drivers, and is really only needed if you have IB attached storage or networking bridges (which I don't think you do, do you?)
>
> Having a non-zero interval here causes the SA to really get pounded on, and leads to timeouts.
>
> If you were to change parameters for IPoIB, I'd change the retry count.  Changing it is a real pain, as far as I know you do it through the device manager UI.  I don't think you want to do this on 1200 nodes. :)
>
> -Fab
>
> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Jeremy Enos
> Sent: Monday, October 29, 2007 3:55 PM
> To: ofw at lists.openfabrics.org
> Subject: [ofw] SA retry count and timeout settings?
>
> Hi-
> I found all my windows hosts in a state where their IPoIB adapter was
> disconnected, and I had to either disable/enable the device or reboot
> the host to get it back.
>
> I discovered that an SM restart occurred around the time the network
> dropped off, and probably took 5 minutes or so to come back up (1200
> HCAs to map in).
>
> The Event Log confirmed this correlation:
> OpenIB IPoIB Adapter #5: Subnet Administrator query for port information
> timed out.  Make sure the SA is functioning properly.  Increasing the
> number of retries and retry timeout adapter parameters may solve the issue.
>
> What I want to know is- where I can find these parameters to change, and
> what I should change them to.  Any reason it shouldn't simply retry
> infinitely?  I'm using driver version 614.
>
> thx-
>
>     Jeremy Enos
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>   



More information about the ofw mailing list