[openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq

Thu Nov 9 07:20:47 PST 2006

Sean Hefty wrote:
> Or Gerlitz wrote:
>> Can be very nice if you share with the community the IB stack issues 
>> revealed under scale-out testing... basically what was the testbed?

> We have a 256 node (512 processors) cluster that we can test with on the 
> second Tuesday following the first Monday of any month with two full 
> moons.  We're only now getting some time on the cluster, and our test 
> capabilities are limited.

> The main issue that we saw was that the SA simply doesn't scale.

I see. Thanks for the detailed response and sorry for the no reply on my 
  side so far, i was too busy...

Your email describes the problem under the all-to-all connection model. 
My thinking is that this design is the first one to be revisited, i 
understand that open mpi opens connections on demand (at this point of 
time it does not use the ib stack connection management services as 
well). Even in the all-to-all-conn model, a question to ask is if the 
connecting is done in N phases or for all ranks you just call in a loop

for(j=i+1; j<n; j++)
	dat_ep_connect(ep[j], ip-address of peer j)

and then

while(there are more non established connections)
    dat_evd_wait(...)

> At 5000 queries per second, it will take the SA nearly 30 seconds to 
> respond to the first set of requests, most of which will have timed 
> out.  By the time it reached the end of the first 130,000 requests, it 
> had hundreds of thousands of queued retries, most of which had also 
> already timed out.  (E.g. even with a exponential backoff, you'd have 
> retries at 4 seconds, 12 seconds, and 28 seconds before the SA can 
> finish processing the first set of requests.)

> To further complicate the issue, retried requests are given new 
> transaction IDs by the ib_sa module, which makes it impossible for the 
> SA to detect retries from original requests.  It sees all requests as 
> new.  On our largest run, we were never able to complete route resolution.

OK, i recall some patch or rfc you have posted which enables a response 
on original request match a "pending retry", basically it means that all 
the retries use the TID of the original request, correct? am i dreaming 
so this is indeed somewhere in the pipe to the kernel?

> We're still exploring possibilities in this area.

Or.