[ewg] [PATCH] Handling busy responses from the SA

Fri Jun 4 15:57:39 PDT 2010

On Fri, Jun 04, 2010 at 02:05:10PM -0700, Hefty, Sean wrote:

> Maybe we should re-think that guideline and allow users to simply
> indicate that the MAD layer should use reasonable defaults.  This
> would enable the ib_mad module to adjust the timeout values for all
> consumers based on actual destination response times.  It could also
> back off retrying multiple requests that were initiated around the
> same time, instead only retrying the first request, while simply
> increasing the timeout values for the others.  This is more complex,
> but we should be able to start with something fairly simple.

A common method for handling this sort of thing is to randomize
the retry timeout. It would be a good idea to randomize all timeouts,
but the BUSY replies should probably randomize over a longer time
period.

Randomization prevents nodes in the cluster from self-synchronizing
and making the load on the SA worse.

But, I also agree with Roland.. having the SA return busy when it is
under load seems insane :) But if you really want to do this then I
think a different, larger, timeout should be used than the standard
mad timeout.

Also, I guess, it would be a good API choice if the caller could say
'get me a reply for this mad or error within 60s' rather than specify
details like retry counts, etc. The timeout values should be globally
set and derived from the usual SA provided data for network transits...

Jason