[openib-general] cmpost test failures

Sean Hefty mshefty at ichips.intel.com
Mon Apr 24 11:05:37 PDT 2006


Ali Ayoub wrote:
> 1. If I change the local and the remote timeout for ib_cm_req_param to 
> 40 (instead of 20, the default value) it causes kernel oops.

The timeout is calculated as: 4.096 x 2 ^ timeout.  In highly technical terms, 
going from 20 to 40 increases the timeout by a factor of a lot (from seconds to 
weeks).

Since the oops occurred in cmpost, I'm not overly concerned with trying to debug 
this at the moment.  (I will happily take a patch that fixes the issue, or will 
look at it more if it definitely looks like an ib_cm bug.  Cmpost just isn't 
meant to be a robust test program.)

> 2. With the following parameters:
> 
>             connections = 3000
> 
>             message_size = 200
> 
>             message_count = 10
> 
>             qp_type = RC
> 
> The test fails inconsistently; in some cases it causes a kernel oops,

This setup will result in allocating a fair amount of memory, which could 
explain the failures.  The oops may be related, but I can't tell just from the 
backtrace.  I've never run into this myself though.  Can you reproduce this 
issue using a smaller number of connections?

Note that when simultaneously establishing a large number of connections, you 
will end up overrunning QP 1 on the remote side.  This will result in a lot of 
dropped MADs, timeouts, and retries, which can make the results of the test 
unpredictable.

> 3. In other cases the server fails because it receives some 
> IB_CM_DREQ_ERROR when the client receives all the IB_CM_DREQ_RECEIVED.

This can occur, and is easier to reproduce for a large number of connections.  A 
DREQ is retried until a DREP is received.  However, since a DREP is not acked, 
once it has been sent, the disconnect is done from the client's perspective.  If 
the DREP is lost, the server will see a DREQ timeout.

There is code in the ib_cm to resend a DREP in response to a repeated DREQ, but 
the state needed to generate the DREP is only maintained while the old 
connection is in timewait.

- Sean



More information about the general mailing list