[openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq
Or Gerlitz
ogerlitz at voltaire.com
Thu Nov 9 07:20:47 PST 2006
Sean Hefty wrote:
> Or Gerlitz wrote:
>> Can be very nice if you share with the community the IB stack issues
>> revealed under scale-out testing... basically what was the testbed?
> We have a 256 node (512 processors) cluster that we can test with on the
> second Tuesday following the first Monday of any month with two full
> moons. We're only now getting some time on the cluster, and our test
> capabilities are limited.
> The main issue that we saw was that the SA simply doesn't scale.
I see. Thanks for the detailed response and sorry for the no reply on my
side so far, i was too busy...
Your email describes the problem under the all-to-all connection model.
My thinking is that this design is the first one to be revisited, i
understand that open mpi opens connections on demand (at this point of
time it does not use the ib stack connection management services as
well). Even in the all-to-all-conn model, a question to ask is if the
connecting is done in N phases or for all ranks you just call in a loop
for(j=i+1; j<n; j++)
dat_ep_connect(ep[j], ip-address of peer j)
and then
while(there are more non established connections)
dat_evd_wait(...)
> At 5000 queries per second, it will take the SA nearly 30 seconds to
> respond to the first set of requests, most of which will have timed
> out. By the time it reached the end of the first 130,000 requests, it
> had hundreds of thousands of queued retries, most of which had also
> already timed out. (E.g. even with a exponential backoff, you'd have
> retries at 4 seconds, 12 seconds, and 28 seconds before the SA can
> finish processing the first set of requests.)
> To further complicate the issue, retried requests are given new
> transaction IDs by the ib_sa module, which makes it impossible for the
> SA to detect retries from original requests. It sees all requests as
> new. On our largest run, we were never able to complete route resolution.
OK, i recall some patch or rfc you have posted which enables a response
on original request match a "pending retry", basically it means that all
the retries use the TID of the original request, correct? am i dreaming
so this is indeed somewhere in the pipe to the kernel?
> We're still exploring possibilities in this area.
Or.
More information about the general
mailing list