[openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq

Fri Nov 3 11:41:50 PST 2006

We were able to get some more test time on the cluster.  Our latest findings are 
below.

> The main issue that we saw was that the SA simply doesn't scale.

 From what we could see, it didn't appear that _any_ path record queries were 
ever lost, even when scaling up to 500,000+ requests.  As long as the query 
timeouts were large enough (dependent on process count), our tests would finish 
within a reasonable time, and without retrying queries.  If the timeout values 
were too small, the SA would form a backlog of timed out requests.

With 1024 processes trying to establish all to all connections, it would take 
about 30 seconds for all nodes to complete path record queries.  The SA was able 
to sustain about 17,000 queries per second.

>>Was the issue with address resolution being ARP request or reply 
>>messages getting lost?

We only just started looking into this when we were bumped off the cluster.  In 
our initial peek at this, it looked like either the ARP requests or replies were 
being discarded on transmit.  Simply increasing the ARP cache timeout fixed most 
of the problems for us.

> The disconnect delay occurred because of remote nodes being slow to respond to 
> disconnect requests.  We're still investigating this issue.

This was a DAPL issue.

- Sean