[ofa-general] IPoIB-UD post_send failures (OFED 1.3)
akepner at sgi.com
akepner at sgi.com
Thu May 15 15:34:18 PDT 2008
Last night we were able to reproduce the bug that I reported at the
beginning of this thread.
In brief, the bug is that we stop getting completions on the IPoIB-UD
send queue. The queue fills, and we get an endless stream of "post_send
failed".
This is with 2.6.16.46-0.12 (SLES 10 SP1), and OFED 1.3, running
on a moderately large (512 CPU) cluster. IB HCA is MT25204, f/w
1.2.0.
The workload is an MPI job of some sort, and the failure seems
to happen very soon (within minutes) after the job starts. We're
using CM.
I've added some debug output to the driver.
The debug driver prints the "tx_outstanding" value when we get
post_send failures in ipoib_send():
ib0: tx_outstanding 0x82 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x84 (ipoib_sendq_size 0x80)
....
(We never call netif_stop_queue().)
I also instrumented calls to mthca_arbel_post_send() and
mthca_poll_one() - we keep a circular buffer of the last
0x80 sends, and completions.
One curious thing is that the mthca_poll_one() and
mthca_arbel_post_send() routines are usually called at a
very regular rate, e.g., here are the last few calls
to mthca_poll_one():
# delta_t head tail
# [jiffies]
....
125 0x5c 0x5b
125 0x5d 0x5c
125 0x5e 0x5d
125 0x5f 0x5e
125 0x60 0x5f
125 0x61 0x60
125 0x62 0x61
125 0x63 0x62
125 0x64 0x63
125 0x65 0x64
125 0x66 0x65
125 0x67 0x66
125 0x68 0x67
125 0x69 0x68
125 0x69 0x69
After a short time, we just stop logging any more calls
to mthca_poll_one().
Then it took a few minutes to fill the queue, and start
making the "post_send failed" messages. The last few
succeeding calls to mthca_arbel_post_send() were:
# delta_t head tail
# [jiffies]
....
125 0xdc 0x69
125 0xdd 0x69
125 0xde 0x69
125 0xdf 0x69
125 0xe0 0x69
125 0xe1 0x69
125 0xe2 0x69
125 0xe3 0x69
125 0xe4 0x69
125 0xe5 0x69
125 0xe6 0x69
125 0xe7 0x69
2860 0xe8 0x69
250 0xe9 0x69
HZ = 250, so we're calling these routines twice per second.
Looks like we must be doing:
static void ipoib_ib_tx_timer_func(unsigned long dev_ptr)
{
if (post_zlen_send_wr(priv, wrid)) {
...
....
poll_tx(priv);
...
mod_timer(&priv->poll_timer, jiffies + HZ / 2);
}
Apparently when we do:
static inline int mthca_poll_one(
...
cqe = next_cqe_sw(cq);
if (!cqe)
return -EAGAIN;
we find that the next CQE is owned by h/w (if we hadn't returned
early, the debug code would've logged the poll.)
--
Arthur
More information about the general
mailing list