[ofa-general] IPoIB-UD post_send failures (OFED 1.3)

Thu May 15 15:34:18 PDT 2008

Last night we were able to reproduce the bug that I reported at the 
beginning of this thread. 

In brief, the bug is that we stop getting completions on the IPoIB-UD 
send queue. The queue fills, and we get an endless stream of "post_send 
failed".

This is with 2.6.16.46-0.12 (SLES 10 SP1), and OFED 1.3, running 
on a moderately large (512 CPU) cluster. IB HCA is MT25204, f/w 
1.2.0.

The workload is an MPI job of some sort, and the failure seems 
to happen very soon (within minutes) after the job starts. We're 
using CM.

I've added some debug output to the driver. 

The debug driver prints the "tx_outstanding" value when we get 
post_send failures in ipoib_send():

ib0: tx_outstanding 0x82 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x84 (ipoib_sendq_size 0x80)
....

(We never call netif_stop_queue().)

I also instrumented calls to mthca_arbel_post_send() and 
mthca_poll_one() - we keep a circular buffer of the last 
0x80 sends, and completions. 

One curious thing is that the mthca_poll_one() and 
mthca_arbel_post_send() routines are usually called at a 
very regular rate, e.g., here are the last few calls 
to mthca_poll_one():

# delta_t    head      tail
# [jiffies]
....
   125      0x5c       0x5b
   125      0x5d       0x5c
   125      0x5e       0x5d
   125      0x5f       0x5e
   125      0x60       0x5f
   125      0x61       0x60
   125      0x62       0x61
   125      0x63       0x62
   125      0x64       0x63
   125      0x65       0x64
   125      0x66       0x65
   125      0x67       0x66
   125      0x68       0x67
   125      0x69       0x68
   125      0x69       0x69

After a short time, we just stop logging any more calls 
to mthca_poll_one().

Then it took a few minutes to fill the queue, and start 
making the "post_send failed" messages. The last few 
succeeding calls to mthca_arbel_post_send() were:

# delta_t    head      tail
# [jiffies]
....
   125      0xdc       0x69
   125      0xdd       0x69
   125      0xde       0x69
   125      0xdf       0x69
   125      0xe0       0x69
   125      0xe1       0x69
   125      0xe2       0x69
   125      0xe3       0x69
   125      0xe4       0x69
   125      0xe5       0x69
   125      0xe6       0x69
   125      0xe7       0x69
   2860     0xe8       0x69
   250      0xe9       0x69

HZ = 250, so we're calling these routines twice per second. 

Looks like we must be doing:

static void ipoib_ib_tx_timer_func(unsigned long dev_ptr)
{

	if (post_zlen_send_wr(priv, wrid)) {
		...
	....
	poll_tx(priv);
	...
	mod_timer(&priv->poll_timer, jiffies + HZ / 2);
}

Apparently when we do:

static inline int mthca_poll_one(

	...
        cqe = next_cqe_sw(cq);
        if (!cqe)
                return -EAGAIN;

we find that the next CQE is owned by h/w (if we hadn't returned 
early, the debug code would've logged the poll.)

-- 
Arthur