[ofw] SA retry count and timeout settings?

Fab Tillier ftillier at windows.microsoft.com
Tue Oct 30 14:14:33 PDT 2007


Why doesn't the client reregister functionality work properly?  The whole point of client reregister is to eliminate the need to retry indefinitely - the SM will notify the nodes when they should reregister.

-Fab

-----Original Message-----
From: ofw-bounces at lists.openfabrics.org [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Tzachi Dar
Sent: Tuesday, October 30, 2007 2:16 PM
To: Jeremy Enos
Cc: ofw at lists.openfabrics.org; Fab Tillier
Subject: RE: [ofw] SA retry count and timeout settings?

Only one small note:

The value of 30 was not a real suggestion but more an example.
If you want the ipoib to do the quarries for a long time than I guess
that bigger values should be used.
For example, if you want to be ready for a SM going down for 24 hours
than I would set
SA Query Retry Count = 60*60*24/10 (seconds) = 8640
SA Query Timeout (ms) = 10000 = 10  seconds

This will make 8640 queries, and it will take 10 seconds between each
query.

By the way, are you using the latest FW for the cards?

Thanks
Tzachi

> -----Original Message-----
> From: Jeremy Enos [mailto:jenos at ncsa.uiuc.edu]
> Sent: Tuesday, October 30, 2007 9:25 PM
> To: Tzachi Dar
> Cc: Fab Tillier; ofw at lists.openfabrics.org
> Subject: Re: [ofw] SA retry count and timeout settings?
>
> Thank you Tzachi and Fab for the excellent explanations and
> tools.  Will implement the value change to 30 as you suggest.
>
>     Jeremy
>
> Tzachi Dar wrote:
> > Hi Jeremy,
> >
> > I'm not sure why the ipoib adapters didn't get a
> notification that the
> > SM was restarting, but it seems that changing the SA Query
> Timeout to
> > a bigger number should eliminate this problem.
> >
> > As for doing this on more than one machine:
> > Well actually there is nothing to it more than editing the
> registry in
> > a specific location. In any case, I'm attaching to this mail a Perl
> > script that does exactly that. (this script changes it for a hard
> > coded machine named sw020, but if you know Perl you can
> easily change
> > that.)
> >
> > Please also note that if there is a need to disable/enable a device
> > from remote than the devman utility can do that easily.
> >
> > For example:
> > devman -m:\\sw020 disable IBA\IPOIB
> > devman -m:\\sw020 enable IBA\IPOIB
> >
> >
> >
> > As for the checksum handling:
> > Let me start by saying that note all Mellanox cards support
> checksum
> > offloading.
> > Even worse, the current driver doesn't allow working with checksum
> > offloading even if the card supports that.
> > So actually disabling checksum on send only means that
> packets will be
> > sent on the wire and will be dropped on the remote machine.
> Disabling
> > checksum on all machines both on send and receive will mean that
> > packets will travel the network with bad checksum and will
> be accepted
> > on the remote machine.
> > So, is this a good idea? Well since IB has it's own CRC (which is
> > actually better than checksum) some believe that if all hosts are
> > windows machines than things should work just fine without
> IP checksum.
> > Performance without checksum is also 20-30% better. But
> this is really
> > up to you.
> >
> > Thanks
> > Tzachi
> >
> >
> >> -----Original Message-----
> >> From: ofw-bounces at lists.openfabrics.org
> >> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Jeremy Enos
> >> Sent: Tuesday, October 30, 2007 3:31 AM
> >> To: Fab Tillier
> >> Cc: ofw at lists.openfabrics.org
> >> Subject: Re: [ofw] SA retry count and timeout settings?
> >>
> >> Hi Fab-
> >> I neglected to describe my full environment.   Currently,
> >> only 30 of the
> >> 1200 nodes are windows hosts.  The SM used is the Cisco
> host based SM
> >> (version 1.1, which requires a topspin stack on those hosts yet).
> >> OpenSM kept blowing up when we tried it I think.
> >>
> >> Anyway... IocPollInterval is indeed set to zero already.
> >> Here's the full snapshot:
> >> [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ibbus\Pa
> >> rameters]
> >> "DebugFlags"=dword:80000000
> >> "ReportPortNIC"=dword:00000001
> >> "IbalDebugLevel"=dword:00000002
> >> "IbalDebugFlags"=dword:00ffffff
> >> "SmiPollInterval"=dword:00004e20
> >> "IocQueryTimeout"=dword:000000fa
> >> "IocQueryRetries"=dword:00000004
> >> "IocPollInterval"=dword:00000000
> >>
> >> In the IPoIB adapter properties:
> >> SA Query Retry Count = 10
> >> SA Query Timeout (ms) = 1000
> >>
> >> Resetting any of those in a non-batch method just isn't
> feasible...
> >> even at 30 nodes, we aren't designing in any non-scalable
> solutions.
> >> ;-)
> >>
> >> I also noticed in there that Send Checksum Offload is disabled by
> >> default... should that be enabled?  Or does it have some drawback?
> >>
> >> Back to the original problem though... given the setting
> above, the
> >> fact that we're using Cisco's host based SM v 1.1 running on a
> >> topspin stack, is there any explanation as to why these
> things didn't
> >> reconnect when the SM came back online?  I guess all the
> Linux hosts
> >> returned just fine.
> >> What should I change?
> >> thx-
> >>
> >>     Jeremy
> >>
> >> p.s.  We'll be moving to Cisco's 1.2 SM soon so that it can run on
> >> OFED based hosts and we can ditch any TS stack requirement.  Just
> >> informing you of that in case you think this is an SM
> related issue.
> >>
> >> Fab Tillier wrote:
> >>
> >>> When the SM comes back up, it's supposed to set the 'Client
> >>>
> >> Reregister' bit for the ports it configures that were
> already in the
> >> active state.  This bit triggers an event on the recipient
> that will
> >> get to IPoIB and cause it to retry it's 'login' logic with
> the SM.
> >> Client reregister should be supported in OpenSM, though maybe it
> >> can't handle all hosts trying to log in at once.
> >>
> >>> Have you tried disabling the I/O Controller scanning (under
> >>>
> >> the ibbus
> >>
> >>> service parameters in the registry, set IocPollInterval to
> >>>
> >> zero - it
> >>
> >>> defaults to 30000).  This will get reset every time you update
> >>> drivers, and is really only needed if you have IB attached
> >>>
> >> storage or
> >>
> >>> networking bridges (which I don't think you do, do you?)
> >>>
> >>> Having a non-zero interval here causes the SA to really get
> >>>
> >> pounded on, and leads to timeouts.
> >>
> >>> If you were to change parameters for IPoIB, I'd change the retry
> >>> count.  Changing it is a real pain, as far as I know you do
> >>>
> >> it through
> >>
> >>> the device manager UI.  I don't think you want to do this on 1200
> >>> nodes. :)
> >>>
> >>> -Fab
> >>>
> >>> -----Original Message-----
> >>> From: ofw-bounces at lists.openfabrics.org
> >>> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of
> Jeremy Enos
> >>> Sent: Monday, October 29, 2007 3:55 PM
> >>> To: ofw at lists.openfabrics.org
> >>> Subject: [ofw] SA retry count and timeout settings?
> >>>
> >>> Hi-
> >>> I found all my windows hosts in a state where their IPoIB
> >>>
> >> adapter was
> >>
> >>> disconnected, and I had to either disable/enable the device
> >>>
> >> or reboot
> >>
> >>> the host to get it back.
> >>>
> >>> I discovered that an SM restart occurred around the time
> >>>
> >> the network
> >>
> >>> dropped off, and probably took 5 minutes or so to come back
> >>>
> >> up (1200
> >>
> >>> HCAs to map in).
> >>>
> >>> The Event Log confirmed this correlation:
> >>> OpenIB IPoIB Adapter #5: Subnet Administrator query for port
> >>> information timed out.  Make sure the SA is functioning properly.
> >>> Increasing the number of retries and retry timeout adapter
> >>>
> >> parameters may solve the issue.
> >>
> >>> What I want to know is- where I can find these parameters
> >>>
> >> to change,
> >>
> >>> and what I should change them to.  Any reason it shouldn't simply
> >>> retry infinitely?  I'm using driver version 614.
> >>>
> >>> thx-
> >>>
> >>>     Jeremy Enos
> >>>
> >>> _______________________________________________
> >>> ofw mailing list
> >>> ofw at lists.openfabrics.org
> >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> >>>
> >>>
> >> _______________________________________________
> >> ofw mailing list
> >> ofw at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> >>
>
_______________________________________________
ofw mailing list
ofw at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw



More information about the ofw mailing list