[openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq

Sean Hefty mshefty at ichips.intel.com
Thu Nov 2 10:09:58 PST 2006


Or Gerlitz wrote:
> Can be very nice if you share with the community the IB stack issues 
> revealed under scale-out testing... basically what was the testbed?

We have a 256 node (512 processors) cluster that we can test with on the second 
Tuesday following the first Monday of any month with two full moons.  We're only 
now getting some time on the cluster, and our test capabilities are limited.

The main issue that we saw was that the SA simply doesn't scale.

>  From what the patch does I understand you attempt to handle timeout on 
> address and route resolution and long disconnect delay.

correct

> Was the issue with address resolution being ARP request or reply 
> messages getting lost?

This appears to be the case.  During test startup, we try to form all to all 
connections.  As we scaled, the number of address resolutions that timed out 
also increased.  We suspect that this is a result of the ipoib broadcast channel 
getting hit with a 100,000+ requests.

> Was the issue with route resolution being timeout on SA Path queries?

Yes - but the issues are more complex than that.

The SA was able to respond to 4000-6000 queries per second.  With an all to all 
connection model, it gets about 130,000 requests.  Assuming that none of these 
are lost and a 4 second timeout, it will be able to respond only a fraction of 
the original requests in time.  The next 100,000+ requests that it responds to 
have already timed out before it can send the response.

At 5000 queries per second, it will take the SA nearly 30 seconds to respond to 
the first set of requests, most of which will have timed out.  By the time it 
reached the end of the first 130,000 requests, it had hundreds of thousands of 
queued retries, most of which had also already timed out.  (E.g. even with a 
exponential backoff, you'd have retries at 4 seconds, 12 seconds, and 28 seconds 
before the SA can finish processing the first set of requests.)

To further complicate the issue, retried requests are given new transaction IDs 
by the ib_sa module, which makes it impossible for the SA to detect retries from 
original requests.  It sees all requests as new.  On our largest run, we were 
never able to complete route resolution.

We're still exploring possibilities in this area.

> Was the issue with disconnect delay that peer A called 
> dat_ep_disconnect() (ie sending DREQ) and the DREP was sent only when 
> peer B got the disconnect event and called dat_ep_disconnect()? so now 
> the DREP is sent from within the provider code when it gets the DREQ?

The disconnect delay occurred because of remote nodes being slow to respond to 
disconnect requests.  We're still investigating this issue.

- Sean




More information about the general mailing list