[ofa-general] [Bug 465] IPoIB CM HA fails after several hours of failures

Michael S. Tsirkin mst at dev.mellanox.co.il
Tue Mar 27 03:02:56 PDT 2007


I am copying the general list on this bug report so that
we can start discussion by mail.

Please do "reply all" keeping the bugmail address and keeping [Bug 465]
in subject so that this thread will get tracked in bugzilla.


> I've been trying IPoIB CM HA for a few weeks, and can't get it to run
> overnight.

...

> Here is the dmesg output.
> A copy is here:
> https://bugs.openfabrics.org/attachment.cgi?id=106&action=view

...

> ib1: failed cm send event (status=12, wrid=28 vend_err 81)

Status 12 means remote side is not sending acks, or have destroyed
the QP.

> ib0: failed cm send event (status=13, wrid=2 vend_err 87)

Status 13 means remote side is not posting receive WRs.

To debug the above 2 errors, we need the log from the remote side.

...

> ib0: Request connection 0x8e05bd for gid fe80:0000:0000:0000:0005:ad00:0020:084a qpn 0x405
> ib0: CM error 0.

This is a local error. 0 corresponds to IB_CM_REQ_ERROR.
It is possible that you were removing the device or module
when this happened?

If yes then it's not a problem.  If no, I attach a patch that will print out the
actual error that triggered this event. It can not fix anything, but if you run
with thus patch and reproduce the CM error above, we will get more information.

> 
> Other times netperf hangs or fails.
> 
> Restarting netperf as is never works.  Sometimes I can restart netperf with
> default socket buffer sizes.

If you can reproduce the hang/fail this might be educational as well.


---

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 842cd0b..3b74ec6 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -2941,6 +2941,8 @@ static void cm_process_send_error(struct ib_mad_send_buf *msg,
 	switch (state) {
 	case IB_CM_REQ_SENT:
 	case IB_CM_MRA_REQ_RCVD:
+		printk("cm_process_send_error state %d wc_status %d\n",
+		       state, wc_status);
 		cm_reset_to_idle(cm_id_priv);
 		cm_event.event = IB_CM_REQ_ERROR;
 		break;

-- 
MST



More information about the general mailing list