[openib-general] [PATCH] IB_CM: Limit the MRA timeout

Wed Oct 4 14:08:13 PDT 2006

> From: Michael S. Tsirkin
> Sent: Wednesday, October 04, 2006 4:37 PM
> To: Sean Hefty
> Cc: Ishai Rabinovitz; openib-general at openib.org
> Subject: Re: [openib-general] [PATCH] IB_CM: Limit the MRA timeout
> 
> Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> > Subject: Re: [PATCH] IB_CM: Limit the MRA timeout
> >
> > Michael S. Tsirkin wrote:
> > >>There's several timeout values transfered and used by the cm, most
> notably the
> > >>remote cm response timeout and packet life time.  Does it make
more
> sense to
> > >>have a single, generic timeout maximum instead?
> > >
> > > Hmm. I'm not sure - we are working around an actual broken
> implementation here -
> > > what do you think?
> >
> > I wasn't sure either.  The MRA timeout is a combination of the
packet
> life time
> > + service timeout, which made me bring this up.  The patch only
handles
> the
> > service timeout portion, so we end up in the same situation if a
large
> packet
> > life time is ever used.
> 
> But that comes from the SA, does it not?
> 
> > >>Would it make more sense to
> > >>enable the maximum(s) by default, since we're dependent upon
values
> received
> > >>over the network?
> > >
> > > I think it would.
> >
> > So do I.
> >
> > The CM has checks to bring out of range values into range, but at
the
> maximum,
> > we get a timeout of about 2.5 hours.  Multiple that by 15 retries,
and
> the cm
> > can literally spend all day retrying a request.
> >
> > I was considering dropping the default maximum down to around 4-8
> seconds, which
> > with retries still gives us about a minute to timeout a request.
The
> default
> > maximum would apply to local and remote cm timeouts, packet life
time,
> and
> > service timeout, but could be overridden by the user.  (Basically,
with
> Ishai's
> > patch: rename mra_timeout_limit to timeout_limit, set to a default
of
> 20, and
> > replace occurrences of '31' in the code with timeout_limit.)
> 
> For remote cm timeout and service timeout this makes sense - they seem
> currently mostly taken out of the blue on implementations I've seen.
> 
> But since the packet lifetime comes from the SM, it actually has a
chance
> to reflect some knowledge about the network topology.
> And since we haven't see any practical issues with packet life time
yet -
> maybe a different paremeter for that, with a higher limit?
> 
> --

I recommend sticking with the IB spec for the various timeouts.  In our
products we carefully implemented the timeouts and computations as
defined by the spec. The SM controls the pkt lifetime and should base it
on a knowledge of the fabric topology and configuration.  Many of the CA
specific base timers are specific to the HCA/TCA itself (hence we
provided this information as part of queries to the CA verbs driver).
We permitted configuration in the individual verbs drivers to override
the "reasonable estimates" which we provided as defaults for each HCA
model we support.

It's a little tricky to work out the details defined in the spec (a
summary section on timers would have made it easier), however I did that
effort a few years ago and here is a summary of all the HCA/TCA related
IB timers below.  Notice  many of these must be "uncomputed" from
information in the CM REQ and REP to get the base level values (such as
pkt lifetime which is not directly specified in CM REQ):

3.1	Base Timers
CA Ack Delay - time from Receipt of IB transport packet to sending of
ACK.  Hardware and VlArb dependent.

CA inbound processing time - time from receipt of IB transport packet to
delivery and processing in CA's transport state machine.  Hardware
dependent.

CA outbound processing time - time from entry of packet to QP until
transmit packet on wire.  hardware and VlArb dependent.

Class turnaround time(class) - processing time from delivery of request
on QP to posting of response on QP

3.2	Derived Timers
Ack Timeout - timeout for QP ACK/NAK before QP resends up to RetryCount
= 2*(PktLifeTime)+Remote CA Ack Delay + local CA inbound processing Time

RNR NAK Delay - Appl protocol must be prepared to replenish Recv Q of QP
within RNR NAK Delay + 2*(PktLifeTime), can set this to low bound and
RNRNakDelay*RNRRetryLimit must be > upper bound

PortInfo:SubnetTimeout = max(PktLifeTime for all pathsRecords within
subnet)

PortInfo:RespTimeout - SMA max time between receipt to response within
Node, includes CA delays in receive and Send.
= ClassTurnaroundTime(SMA) + CA inbound (QP0) + CA outbound (QP0)

ClassPortInfo:RespTimeout- GSA class max time between receipt to
response within Node, includes CA delays in receive and Send.
= ClassTurnaroundTime(class) + CA inbound (QP1) + CA outbound (QP1)

PathRecord:PacketLifeTime - reasonable estimate of worst case time
through path for packet to traverse fabric in 1 direction.  0 if
loopback path from port to itself (CA inbound/outbound and/or ACK delay
values should cover)

LocalAckTimeout - QP/CM - 2*PathRecord:PktLifeTime + local CA Ack Delay

QP:AckTimeout - use 2*PathRecord:PktLifeTime + remote CA Ack Delay

Remote CM Resp Timeout - CM - CM server REQ response time (should be
based on Get(ClassPortInfo) for CM against remote CM)

Local CM Resp Timeout - 2*PathRecord:PktLifetime + client REP response
time 

CM MRA Service Timeout - anticipated maximum time before sender of the
MRA will send the actual CM response message (REP, RTU, APR or REJ).
Recipient of MRA should wait Service Timeout + packet lifetime before
timing out.  Note this value is subjective in nature and may depend on
load on the server, performance of the application, etc.  In our stack
we heuristically computed a pseudo average (weighted toward longer
timeouts) with configurable min/max.  We also permitted the application
to adjust the min/max for a given CEP.  It was important that the MRA
sending be issued at a low level since if the application is too busy to
respond to the REQ, REP, etc; its probably also too busy to compose an
MRA.  We found that proper implementation of MRA was critical for high
stress CM situations, such as startup of a large MPI run or Oracle's
uDAPL based stress tests which made thousands of simultaneous
connections.

Subnet Timeout = max(Path Record Packet Lifetime)

Todd Rimmer