[openib-general] CM and REP handling

Rimmer, Todd trimmer at silverstorm.com
Fri Jun 30 12:46:40 PDT 2006


> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> Sent: Friday, June 30, 2006 2:28 PM
> 
> Rimmer, Todd wrote:
> > Shouldn't the cm_dup_req_handler in this case also resend the REP
per
> > the IBTA passive side state machine "REP Sent" state?
> 
> The REP will already being retried based on a timeout.  It could be
resent
> immediately in response to a duplicate REQ as well, but that shouldn't
be
> necessary, and actually makes things more complex, since coordination
must
> be
> done between sending based on a timeout, versus receiving a duplicate
REQ.

I would recommend implementing the state machine as defined in the spec
for the following reasons:

1. it will be necessary to pass any future IBTA CIWG compliance tests
for the CM

2. I would need to think about it, but the lost REP case may not be the
only situation where a duplicate REQ can be received.

3. depending on RTU timeout on the passive side as the only means for
resending the REP reduces the retries attempted in a "lossy" fabric for
REP and RTU loss (eg. if you have 8 RTU timeout retries on passive side,
and many REPs are lost followed by many RTUs, you get a total of 8 lost
REPs+RTUs before you give up, managing the counters separately will tend
allow for more retries).

In our proprietary stack we implemented the defined state machine and
have stressed it for 1000s of concurrent connections (including various
Chariot SDP connect/disconnect stress tests and Oracle uDAPL stress
tests plus our use of the CM to establish connections when running MPI
on 1000s of nodes) in various real world and contrived situations of
packet loss and slow responsiveness and the defined state machine has
worked very well for all these situations.

Todd Rimmer




More information about the general mailing list