[ofw] SA retry count and timeout settings?

Tzachi Dar tzachid at mellanox.co.il
Tue Oct 30 02:39:20 PDT 2007


Hi Jeremy,

I'm not sure why the ipoib adapters didn't get a notification that the
SM was restarting, but it seems that changing the SA Query Timeout to a
bigger number should eliminate this problem.

As for doing this on more than one machine:
Well actually there is nothing to it more than editing the registry in a
specific location. In any case, I'm attaching to this mail a Perl script
that does exactly that. (this script changes it for a hard coded machine
named sw020, but if you know Perl you can easily change that.)

Please also note that if there is a need to disable/enable a device from
remote than the devman utility can do that easily.

For example:
devman -m:\\sw020 disable IBA\IPOIB
devman -m:\\sw020 enable IBA\IPOIB



As for the checksum handling:
Let me start by saying that note all Mellanox cards support checksum
offloading.
Even worse, the current driver doesn't allow working with checksum
offloading even if the card supports that.
So actually disabling checksum on send only means that packets will be
sent on the wire and will be dropped on the remote machine. Disabling
checksum on all machines both on send and receive will mean that packets
will travel the network with bad checksum and will be accepted on the
remote machine.
So, is this a good idea? Well since IB has it's own CRC (which is
actually better than checksum) some believe that if all hosts are
windows machines than things should work just fine without IP checksum.
Performance without checksum is also 20-30% better. But this is really
up to you.

Thanks
Tzachi

> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org 
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Jeremy Enos
> Sent: Tuesday, October 30, 2007 3:31 AM
> To: Fab Tillier
> Cc: ofw at lists.openfabrics.org
> Subject: Re: [ofw] SA retry count and timeout settings?
> 
> Hi Fab-
> I neglected to describe my full environment.   Currently, 
> only 30 of the 
> 1200 nodes are windows hosts.  The SM used is the Cisco host 
> based SM (version 1.1, which requires a topspin stack on 
> those hosts yet).  
> OpenSM kept blowing up when we tried it I think.
> 
> Anyway... IocPollInterval is indeed set to zero already.  
> Here's the full snapshot:
> [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ibbus\Pa
> rameters]
> "DebugFlags"=dword:80000000
> "ReportPortNIC"=dword:00000001
> "IbalDebugLevel"=dword:00000002
> "IbalDebugFlags"=dword:00ffffff
> "SmiPollInterval"=dword:00004e20
> "IocQueryTimeout"=dword:000000fa
> "IocQueryRetries"=dword:00000004
> "IocPollInterval"=dword:00000000
> 
> In the IPoIB adapter properties:
> SA Query Retry Count = 10
> SA Query Timeout (ms) = 1000
> 
> Resetting any of those in a non-batch method just isn't 
> feasible... even at 30 nodes, we aren't designing in any 
> non-scalable solutions.  ;-)
> 
> I also noticed in there that Send Checksum Offload is 
> disabled by default... should that be enabled?  Or does it 
> have some drawback?
> 
> Back to the original problem though... given the setting 
> above, the fact that we're using Cisco's host based SM v 1.1 
> running on a topspin stack, is there any explanation as to 
> why these things didn't reconnect when the SM came back 
> online?  I guess all the Linux hosts returned just fine.  
> What should I change?
> thx-
> 
>     Jeremy
> 
> p.s.  We'll be moving to Cisco's 1.2 SM soon so that it can 
> run on OFED based hosts and we can ditch any TS stack 
> requirement.  Just informing you of that in case you think 
> this is an SM related issue.
> 
> Fab Tillier wrote:
> > When the SM comes back up, it's supposed to set the 'Client 
> Reregister' bit for the ports it configures that were already 
> in the active state.  This bit triggers an event on the 
> recipient that will get to IPoIB and cause it to retry it's 
> 'login' logic with the SM.  Client reregister should be 
> supported in OpenSM, though maybe it can't handle all hosts 
> trying to log in at once.
> >
> > Have you tried disabling the I/O Controller scanning (under 
> the ibbus 
> > service parameters in the registry, set IocPollInterval to 
> zero - it 
> > defaults to 30000).  This will get reset every time you update 
> > drivers, and is really only needed if you have IB attached 
> storage or 
> > networking bridges (which I don't think you do, do you?)
> >
> > Having a non-zero interval here causes the SA to really get 
> pounded on, and leads to timeouts.
> >
> > If you were to change parameters for IPoIB, I'd change the retry 
> > count.  Changing it is a real pain, as far as I know you do 
> it through 
> > the device manager UI.  I don't think you want to do this on 1200 
> > nodes. :)
> >
> > -Fab
> >
> > -----Original Message-----
> > From: ofw-bounces at lists.openfabrics.org 
> > [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Jeremy Enos
> > Sent: Monday, October 29, 2007 3:55 PM
> > To: ofw at lists.openfabrics.org
> > Subject: [ofw] SA retry count and timeout settings?
> >
> > Hi-
> > I found all my windows hosts in a state where their IPoIB 
> adapter was 
> > disconnected, and I had to either disable/enable the device 
> or reboot 
> > the host to get it back.
> >
> > I discovered that an SM restart occurred around the time 
> the network 
> > dropped off, and probably took 5 minutes or so to come back 
> up (1200 
> > HCAs to map in).
> >
> > The Event Log confirmed this correlation:
> > OpenIB IPoIB Adapter #5: Subnet Administrator query for port 
> > information timed out.  Make sure the SA is functioning properly.  
> > Increasing the number of retries and retry timeout adapter 
> parameters may solve the issue.
> >
> > What I want to know is- where I can find these parameters 
> to change, 
> > and what I should change them to.  Any reason it shouldn't simply 
> > retry infinitely?  I'm using driver version 614.
> >
> > thx-
> >
> >     Jeremy Enos
> >
> > _______________________________________________
> > ofw mailing list
> > ofw at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> >   
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ChangeSaCount.pl_
Type: application/octet-stream
Size: 805 bytes
Desc: ChangeSaCount.pl_
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20071030/33bdce70/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: devman.ex_
Type: application/octet-stream
Size: 64000 bytes
Desc: devman.ex_
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20071030/33bdce70/attachment-0001.obj>


More information about the ofw mailing list