[ofa-general] IPoIB-UD post_send failures (OFED 1.3)

akepner at sgi.com akepner at sgi.com
Thu May 8 10:19:36 PDT 2008


In an earlier email I mentioned that, with certain workloads, we 
are seeing an endless loop of timeouts on the IPoIB-UD send queue.
Messages like "NETDEV WATCHDOG: ib0: transmit timed out" appear 
once a second until the driver is unloaded. That was with OFED 1.2. 

Using OFED 1.3, we see what I believe is the same problem, but it
looks a little different. We don't get "NETDEV WATCHDOG", but 
we get an endless string of "post_send failed".

(I suspect, but haven't verified, that the difference is due to 
the sharing of ipoib_dev_priv's tx_outstanding member between 
the UD and CM IPoIB QPs, the value of tx_outstanding is used 
to determine when to call netif_stop_queue().)

The h/w is MT25204, with f/w version 1.2.0, on an x86_64.

I instrumented the mthca driver to maintain a cicular buffer 
of the state of the IPoIB-UD send queue on each call to the  
"post_send" (mthca_arbel_post_send) and "poll_cq" (mthca_poll_one) 
routines, and also to dump the QP and CQ context when the full 
queue is detected. 

At some point, we just stop getting completions on the send queue. 
Here are the last entries from the "poll_cq" log:

# jiffies   qpn   last head  tail
#                 comp 
.....
0x100032cdc 0x404 0x49 0x24b 0x24a
0x100032cdc 0x404 0x4a 0x24b 0x24b
0x100033eed 0x404 0x4c 0x24e 0x24d
0x100033eed 0x404 0x4d 0x24e 0x24e
0x10003b594 0x404 0x4f 0x251 0x250
0x10003b594 0x404 0x50 0x251 0x251
0x10003c999 0x404 0x52 0x254 0x253
0x10003ca16 0x404 0x53 0x255 0x254
0x10003ca93 0x404 0x54 0x256 0x255
0x10003ca93 0x404 0x55 0x256 0x256

We keep calling the send routine (apparently via the periodic 
ipoib_ib_tx_timer_func()) and keep getting a "queue full" condition - 
the send queue length is 128. Here are some entries after the queue 
has filled (they keep going "forever"):


# jiffies   qpn   last head  tail
#                 comp 
.....
0x1000760dd 0x404 0x55 0x2d6 0x256
0x1000761c6 0x404 0x55 0x2d6 0x256
0x1000761d7 0x404 0x55 0x2d6 0x256
0x1000762c0 0x404 0x55 0x2d6 0x256
0x1000762d1 0x404 0x55 0x2d6 0x256
0x1000763ba 0x404 0x55 0x2d6 0x256


And here's the QP and CQ context immediately after the first 
post_send failure:

QP context (including the 2-32 bit "opt_param_mask" 
and reserved fields at the beginning):
[00] 0x00000000 0x00000000 0x30031900 0xef3e3f16
[10] 0x8b423b00 0x00000002 0x00000404 0x00000000
[20] 0x00000000 0x00000000 0x01000000 0x60000000
[30] 0x00000000 0x00000000 0x00000000 0x00000000
[40] 0x00000000 0x00000000 0x00000000 0x00000000
[50] 0x00000000 0x00000000 0x00000000 0x00000000
[60] 0x00000000 0x00000000 0x00000000 0x00000006
[70] 0x00000000 0x00002600 0xaf004000 0x00800088
[80] 0x00000256 0x00000082 0x00004000 0x00000005
[90] 0x00ffffff 0x00000257 0x00000008 0x003a277f
[a0] 0x25020200 0x00000081 0x00000000 0x00007ff9
[b0] 0x00000b1b 0x00000000 0x000003f8 0x03f80256
[c0] 0x00000000 0x00000000 0x00000000 0x00000000
[d0] 0x00000000 0x00000000 0x00000000 0x00000000
[e0] 0x00000000 0x00000000 0x00000000 0x00000000
[f0] 0x00000000 0x00000000 0x00000000 0x00000000
CQ context:
[00] 0x00000a00 0x00000000 0x00000000 0x08000002
[10] 0x00000000 0x00000001 0x00000004 0x00002500
[20] 0x000001fd 0x000001fd 0x00000000 0x00000238
[30] 0x00000082 0x00007ffa 0x00000004 0x00000000


I don't see anything obviously wrong here - anyone at Mellanox? 
Any idea why the card would stop generating TX completions?

-- 
Arthur




More information about the general mailing list