[Openib-windows] RE: Connection rate of WSD

Fabian Tillier ftillier at silverstorm.com
Sat Jun 3 21:49:00 PDT 2006


Hi Tzachi,

On 6/3/06, Tzachi Dar <tzachid at mellanox.co.il> wrote:
>
> I'll try this fix, but I think that it will not help.
>
> The main reason for this is that according to the specs (if I remember
> correctly) the RNR timeout can be set to somewhere between x and 7 (7 is
> hard coded in the spec) x.

The retry count can be between 0 and 7, with 7 meaning infinite.  The
timeout is 5 bits, with encoded times of 0.01 to 655 milliseconds.

With a retry count of 7, the HCA will retry forever, which allows the
timeout to be fairly agressive. I selected a timeout of 40ms, which
may be too long for good connection establishment rates - we can play
around with what an appropriate value is.

> What we are looking right now is the correct value of x. Since the receives
> are posted by a user mode thread from the switch, we need to try and
> understand the amount of time that a user mode thread can get delayed. The
> correct answer is probably that on a very busy system this time is almost
> for ever. (at least there is no way to give a maximum to this time).

Right, hence the infinite RNR retry.

> Assuming that the system is never that busy, we should give it at least the
> needed time for a thread context switch which is 60ms. (On the worst case we
> might want to give other times, such as 2,3, or n times this period.)

The scheduler interval is 10ms, so I don't see why a thread context
switch would take 60ms.  The worst case thread context switch time
should be 10ms (the case where the other thread is signalled to wake
up in the very beginning of another thread's time quantum).

> But
> even being very conservative and assuming that we wish to allow only twice
> this time this brings us to 120ms at maximum. 120/7 = 17, and there for the
> RNR retry time should be 17ms. 1000/17 =~60 and this is an upper limit to
> our connection rate, which is not that nice. (actually it is somewhat bigger
> as I saw that only one connection out of 5 reaches to RNR on a non busy
> system).

The RNR retry does not stall connection establishment.  Connection
establishment will complete and the active side will post the hello
message, which will sit in RNR retry until the passive side posts its
first receive.  The RNR timeout may add one RNR timeout's worth of
time to how quickly the socket will become opperable, since the socket
can't be used until the hello ack message is received.

How close to the RNR retry timeout do Mellanox HCAs perform the retry?
 The IB spec does not prevent an implementation from waiting to retry
longer than the requested RNR timeout.  You were previously seeing
delays of roughly 2 seconds - can you take a trace to see how many RNR
retries happened during this time?  They should have been every 655ms
- if not, the HCA is increasing that time.

> That said, I'll give your code a chance and see how this works.
>
> As for the patch that I have sent that tries to solve the same problem
> (delay CM until there is a recv posted). Considerations there are similar
> but we have more freedom in choosing the constants so I believe that it
> should work better.

Delaying the REP will certainly affect the connection rate, as the
connections will not finish being established.  You also risk timing
out the CM protocol, which can result in unnecessary connection
timeouts.  I don't see why delaying the REP gives any more control
over the timeout value - if anything it will have to be the worst
case, while the RNR timeout could be the common case, with the worst
case resulting in more retries.

> As for the latencies that will be introduced by your patch: It seems that
> the current code already takes the lock at the beginning of the recv, so
> another check if this is the first buffer shouldn't take more than a single
> if statement which is less than 10ns to my understanding.

There is no locking between WSPRecv and the completion processing.  To
properly add buffering, we would need to introduce such locking.  The
cases we would need to handle are:

1. bufffered receive completes before first WSPRecv call
2. buffered receive completes after first WSPRecv call
3. buffered receive completes during first WSDRecv call.

Case number 3 requires the serialization between the WSPRecv call and
the completion callback.  This serialization would introduce extra
latency, not the individual if statements.

> In any case I'll test your fix and see how things are working.

Thanks, let me know if you find a better value for the retry timeout.

- Fab



More information about the ofw mailing list