[openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

Hal Rosenstock halr at voltaire.com
Tue Jun 14 05:41:43 PDT 2005


On Mon, 2005-06-13 at 18:33, James Lentini wrote:
> On Mon, 13 Jun 2005, Hal Rosenstock wrote:
> 
> halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote: 
> halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote:
> halr> > 
> halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote:
> halr> > halr> > We interpreted the above to mean "give the connection protocol as 
> halr> > halr> > much time as it needs to establish a connection, but don't mask 
> halr> > halr> > errors (no path to the remove node, etc.)". For that reason we changed 
> halr> > halr> > the variable name to DAT_TIMEOUT_MAX.
> halr> > halr> 
> halr> > halr> But if the REQ is lost, the timeout is really really long (longer than
> halr> > halr> most will wait for an error). 
> halr> > 
> halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a 
> halr> > smaller amount of time to dat_ep_connect. Does this satisfy your 
> halr> > requirements?
> halr> 
> halr> Is it the intended that the only way out is via user intervention (e.g.
> halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then
> halr> there is no chance of it completing and the user needs to intervene. 
> 
> Why does the user need to intervene? Did I misunderstanding the CM 
> API? 
> 
> When dapl_ep_connect() is called with a timeout value of 
> DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the 
> ib_cm_req_param structure's remote_cm_response_timeout value. My 
> understanding was that this is the maximum timeout and that once it 
> expires the CM will inform the user that the REQ timed out.

Yes but it is a long time (4.096 * 2 ^ 31 usec ~ 8796 sec ~ 146.60 min
(if my calcs are correct)). This is longer than (most) users would wait.
They would usually hit ctl-C before this timeout is reached.

> halr> If that is the intended behavior, we are there. (This (lost REQ) 
> halr> can even occur when the timeout is non infinite too).
> 
> We didn't intend for the active side to wait forever if a REQ was 
> lost.

The active side has no way of knowing that the REQ was lost (other than
timeout/retry) and when the timeout is long, this is effectively the
case.

> halr> An alternative (as Sean suggested) is to continually retry (at a
> halr> periodicity below the supplied timeout) until the time period specified
> halr> expires. That seems to be better (at least to me and Sean) in terms of
> halr> handling the lost REQ case. As retries is not part of the API for
> halr> connect, I would presume the implementor is free to what they want under
> halr> the covers of dapl_ib_connect.
> 
> You're correct.

The current implementation is:
1. address resolution phase for some amount of time 
followed by:
2. dapl_ib_connect timeout * 5 (since there are 4 retries)

A better algorithm would be to divide down the timeout by some number of
retries (which would vary based on the timeout requested) and have the
number of retries vary based on the total timeout requested.

-- Hal




More information about the general mailing list