[openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq
Sean Hefty
mshefty at ichips.intel.com
Fri Nov 3 11:41:50 PST 2006
We were able to get some more test time on the cluster. Our latest findings are
below.
> The main issue that we saw was that the SA simply doesn't scale.
From what we could see, it didn't appear that _any_ path record queries were
ever lost, even when scaling up to 500,000+ requests. As long as the query
timeouts were large enough (dependent on process count), our tests would finish
within a reasonable time, and without retrying queries. If the timeout values
were too small, the SA would form a backlog of timed out requests.
With 1024 processes trying to establish all to all connections, it would take
about 30 seconds for all nodes to complete path record queries. The SA was able
to sustain about 17,000 queries per second.
>>Was the issue with address resolution being ARP request or reply
>>messages getting lost?
We only just started looking into this when we were bumped off the cluster. In
our initial peek at this, it looked like either the ARP requests or replies were
being discarded on transmit. Simply increasing the ARP cache timeout fixed most
of the problems for us.
> The disconnect delay occurred because of remote nodes being slow to respond to
> disconnect requests. We're still investigating this issue.
This was a DAPL issue.
- Sean
More information about the general
mailing list