[Openib-windows] RE: Connection rate of WSD

Mon Jun 5 09:17:20 PDT 2006

Hi Fab, 
1) Please see bellow my answers. I still don't see how playing with the
rnr timeout will solve the problem.
2) The way we see it there are two possible answers. A - play with the
cm . This will slow connection establishment, but gives some more
freedom (the CM is in software). Please also note that as for timeouts
the CM has another message MRA (more processing required) which gives us
exactly the freedom to do what we want. We answer that we received the
request and still thinking what to do. So this is a timeout free
solution.
As for the other solution: posting the first receive this has the
advantage that we follow the WSD spec. As for the latency introduced: I
believe that we can add another variable that will tell if the first
buffer was already received correctly. On the buffer complete side the
first action will be to check if the first receive was already handled.
Only if not, it will take the lock and do the complex thing. As a result
I believe that the latency introduced on most of the buffers will only
be an if statement latency, which is quiet small.

What do you think?

Thanks
Tzachi

> -----Original Message-----
> From: ftillier.sst at gmail.com [mailto:ftillier.sst at gmail.com] 
> On Behalf Of Fabian Tillier
> Sent: Sunday, June 04, 2006 7:49 AM
> To: Tzachi Dar
> Cc: Leonid Keller; openib-windows at openib.org
> Subject: Re: [Openib-windows] RE: Connection rate of WSD
> 
> Hi Tzachi,
> 
> On 6/3/06, Tzachi Dar <tzachid at mellanox.co.il> wrote:
> >
> > I'll try this fix, but I think that it will not help.
> >
> > The main reason for this is that according to the specs (if 
> I remember
> > correctly) the RNR timeout can be set to somewhere between 
> x and 7 (7 
> > is hard coded in the spec) x.
> 
> The retry count can be between 0 and 7, with 7 meaning 
> infinite.  The timeout is 5 bits, with encoded times of 0.01 
> to 655 milliseconds.
> 
> With a retry count of 7, the HCA will retry forever, which 
> allows the timeout to be fairly agressive. I selected a 
> timeout of 40ms, which may be too long for good connection 
> establishment rates - we can play around with what an 
> appropriate value is.
>From what I saw the connection is not completed until we complete the
send which means that the connection will be stalled for 40 ms. This is
too  much (not more than 25 connections per second,m per thread).

> > What we are looking right now is the correct value of x. Since the 
> > receives are posted by a user mode thread from the switch, 
> we need to 
> > try and understand the amount of time that a user mode 
> thread can get 
> > delayed. The correct answer is probably that on a very busy system 
> > this time is almost for ever. (at least there is no way to 
> give a maximum to this time).
> 
> Right, hence the infinite RNR retry.

infinite RNR retry is not good, as the other side might realy be lost.
We will not get an error than.

> > Assuming that the system is never that busy, we should give it at 
> > least the needed time for a thread context switch which is 
> 60ms. (On 
> > the worst case we might want to give other times, such as 2,3, or n 
> > times this period.)
> 
> The scheduler interval is 10ms, so I don't see why a thread 
> context switch would take 60ms.  The worst case thread 
> context switch time should be 10ms (the case where the other 
> thread is signalled to wake up in the very beginning of 
> another thread's time quantum).
As far as I remember that time for server is 60 ms. In any case the
default time comes from the fact that there is more than one thread that
is before of you. In this cases at least for picks we will not get
schedualed for a second or so. We have to live with this picks.

> > But
> > even being very conservative and assuming that we wish to 
> allow only 
> > twice this time this brings us to 120ms at maximum. 120/7 = 17, and 
> > there for the RNR retry time should be 17ms. 1000/17 =~60 
> and this is 
> > an upper limit to our connection rate, which is not that nice. 
> > (actually it is somewhat bigger as I saw that only one 
> connection out 
> > of 5 reaches to RNR on a non busy system).
> 
> The RNR retry does not stall connection establishment.  
> Connection establishment will complete and the active side 
> will post the hello message, which will sit in RNR retry 
> until the passive side posts its first receive.  The RNR 
> timeout may add one RNR timeout's worth of time to how 
> quickly the socket will become opperable, since the socket 
> can't be used until the hello ack message is received.

>From what I saw the connecting side doesn't get connected until the
first message is sent. This is the reason for low connection rate. 

> How close to the RNR retry timeout do Mellanox HCAs perform the retry?
>  The IB spec does not prevent an implementation from waiting 
> to retry longer than the requested RNR timeout.  You were 
> previously seeing delays of roughly 2 seconds - can you take 
> a trace to see how many RNR retries happened during this 
> time?  They should have been every 655ms
> - if not, the HCA is increasing that time.
> 
> > That said, I'll give your code a chance and see how this works.
> >
> > As for the patch that I have sent that tries to solve the 
> same problem 
> > (delay CM until there is a recv posted). Considerations there are 
> > similar but we have more freedom in choosing the constants so I 
> > believe that it should work better.
> 
> Delaying the REP will certainly affect the connection rate, 
> as the connections will not finish being established.  You 
> also risk timing out the CM protocol, which can result in 
> unnecessary connection timeouts.  I don't see why delaying 
> the REP gives any more control over the timeout value - if 
> anything it will have to be the worst case, while the RNR 
> timeout could be the common case, with the worst case 
> resulting in more retries.
> 
> > As for the latencies that will be introduced by your patch: 
> It seems 
> > that the current code already takes the lock at the 
> beginning of the 
> > recv, so another check if this is the first buffer 
> shouldn't take more 
> > than a single if statement which is less than 10ns to my 
> understanding.
> 
> There is no locking between WSPRecv and the completion 
> processing.  To properly add buffering, we would need to 
> introduce such locking.  The cases we would need to handle are:
> 
> 1. bufffered receive completes before first WSPRecv call 2. 
> buffered receive completes after first WSPRecv call 3. 
> buffered receive completes during first WSDRecv call.
> 
> Case number 3 requires the serialization between the WSPRecv 
> call and the completion callback.  This serialization would 
> introduce extra latency, not the individual if statements.
> 
> > In any case I'll test your fix and see how things are working.
> 
> Thanks, let me know if you find a better value for the retry timeout.
> 
> - Fab
>