[openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

Tue Jun 14 09:08:51 PDT 2005

On Tue, 14 Jun 2005, Hal Rosenstock wrote:

> On Mon, 2005-06-13 at 18:33, James Lentini wrote:
>> On Mon, 13 Jun 2005, Hal Rosenstock wrote:
>>
>> halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote:
>> halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote:
>> halr> >
>> halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote:
>> halr> > halr> > We interpreted the above to mean "give the connection protocol as
>> halr> > halr> > much time as it needs to establish a connection, but don't mask
>> halr> > halr> > errors (no path to the remove node, etc.)". For that reason we changed
>> halr> > halr> > the variable name to DAT_TIMEOUT_MAX.
>> halr> > halr>
>> halr> > halr> But if the REQ is lost, the timeout is really really long (longer than
>> halr> > halr> most will wait for an error).
>> halr> >
>> halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a
>> halr> > smaller amount of time to dat_ep_connect. Does this satisfy your
>> halr> > requirements?
>> halr>
>> halr> Is it the intended that the only way out is via user intervention (e.g.
>> halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then
>> halr> there is no chance of it completing and the user needs to intervene.
>>
>> Why does the user need to intervene? Did I misunderstanding the CM
>> API?
>>
>> When dapl_ep_connect() is called with a timeout value of
>> DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the
>> ib_cm_req_param structure's remote_cm_response_timeout value. My
>> understanding was that this is the maximum timeout and that once it
>> expires the CM will inform the user that the REQ timed out.
>
> Yes but it is a long time (4.096 * 2 ^ 31 usec ~ 8796 sec ~ 146.60 min
> (if my calcs are correct)). This is longer than (most) users would wait.
> They would usually hit ctl-C before this timeout is reached.

Understood. As long as it is not infinite we've made a step in the 
right direction. I like your ideas below on how to improve this 
further.

>> halr> If that is the intended behavior, we are there. (This (lost REQ)
>> halr> can even occur when the timeout is non infinite too).
>>
>> We didn't intend for the active side to wait forever if a REQ was
>> lost.
>
> The active side has no way of knowing that the REQ was lost (other than
> timeout/retry) and when the timeout is long, this is effectively the
> case.

This behavior is ok. The DAT consumer should choose timeout value that 
makes sense, it doesn't need to use DAT_TIMEOUT_MAX (and probably 
shouldn't in most cases). We should update our dapltest program to use 
a smaller value (like 1 min).

>> halr> An alternative (as Sean suggested) is to continually retry (at a
>> halr> periodicity below the supplied timeout) until the time period specified
>> halr> expires. That seems to be better (at least to me and Sean) in terms of
>> halr> handling the lost REQ case. As retries is not part of the API for
>> halr> connect, I would presume the implementor is free to what they want under
>> halr> the covers of dapl_ib_connect.
>>
>> You're correct.
>
> The current implementation is:
> 1. address resolution phase for some amount of time
> followed by:
> 2. dapl_ib_connect timeout * 5 (since there are 4 retries)

Sounds like I need to understand the difference between the 
ib_cm_req_param's retry_count and max_cm_retries fields. We set the 
former to 0 and the later to 4.

> A better algorithm would be to divide down the timeout by some number of
> retries (which would vary based on the timeout requested) and have the
> number of retries vary based on the total timeout requested.

I agree that would be better. As you point out, we should also account 
for the address resolution time. I know that no one is working on 
this. Are you interested?

>
> -- Hal
>