[openib-general] [PATCH for-2.6.18] Re: [PATCH] IB/cma: add rdma_establish

Michael S. Tsirkin mst at mellanox.co.il
Wed Sep 13 05:01:54 PDT 2006


Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [PATCH] IB/cma: add rdma_establish
> 
> Michael S. Tsirkin wrote:
> >>>As a side note, reasons for frequent loss of RTU must be investigated.
> >>
> >>A lost RTU shouldn't be any more likely than a lost REQ or REP.  Is the RTU 
> >>never showing up?
> > 
> > 
> > Seems like that. I know fir sure I do accept after REP but remote side never
> > gets ESTABLISHED.
> 
> I looked at the code, then ran some tests.  The REP is retried until an RTU is 
> received, or its number of retries is exhausted.  By modifying the IB CM, I was 
> able to force RTU drops.  Using madeye, I could see that the REP would be 
> retried, resulting in the RTU being resent.  After 4 drops, I had the code 
> receive the RTU, which allowed the test to proceed.
> 
> A couple things to look at in OFED would be the setting of max cm retries and 
> the cm timeout.

What I think we need for 2.6.18 is the following. Pls comment.


IB/cma: increase the retry count in CMA from 3 to maximum 15.
3 seems low - we see connections failing under stress - and in any case looks
like an arbitrary number. 15 is the max value allowed by spec.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d6f99d5..5d625a8 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -49,7 +49,7 @@ MODULE_DESCRIPTION("Generic RDMA CM Agen
 MODULE_LICENSE("Dual BSD/GPL");
 
 #define CMA_CM_RESPONSE_TIMEOUT 20
-#define CMA_MAX_CM_RETRIES 3
+#define CMA_MAX_CM_RETRIES 15
 
 static void cma_add_one(struct ib_device *device);
 static void cma_remove_one(struct ib_device *device);

-- 
MST




More information about the general mailing list