[ofa-general] IPoIB CQ overrun

Mon Dec 3 18:01:04 PST 2007

On Mon, Nov 19, 2007 at 08:29:36PM -0800, Roland Dreier wrote:
> 
> OFED 1.2 uses a separate CQ for send completions in connected mode.
> (I'm assuming you're using the OFED default of connected mode for
> IPoIB).  I guess it would be useful to know which CQ is overrunning,
> ie whether it is the main IPoIB CQ or one of the CM send CQs.  One way
> to check this would be to add a print to mthca to dump the CQN when a
> CQ is created, and also add prints to IPoIB just before each call to
> ib_create_cq() so that the CQNs can be correlated.
> 
> Another thing you could try would be a 2.6.24-rc kernel (or an OFED
> 1.3 prerelease I guess), which has a change that moves all completions
> into one CQ in IPoIB.  This may fix the bug by accident.
> 

Yes, we're using CM.

I dumped out the CQNs as they were created and generally the first 
non-reserved CQs get made by ipoib_transport_dev_init() when ipoib 
is brought up on each port. CQN 0x80 is used by port 0, 0x81 by 
port 1. 

The other CQs used by IPoIB are the ones made by ipoib_cm_tx_init(). 

We see overruns on both types of CQ. 

Here's an overrun on the main IPoIB CQ (CQN 0x80):

Dec  2 10:18:08 r6i1n8 kernel: ib0: Send unicast ARP to 0165
Dec  2 10:18:13 r6i1n8 kernel: ib1: Send unicast ARP to 016d
Dec  2 10:18:28 r6i1n8 kernel: ib0: Send unicast ARP to 0165
Dec  2 10:18:39 r6i1n8 kernel: ib0: Send unicast ARP to 010a
Dec  2 10:18:48 r6i1n8 kernel: ib0: Send unicast ARP to 0165
Dec  2 10:19:08 r6i1n8 kernel: ib0: Send unicast ARP to 0165
Dec  2 10:19:13 r6i1n8 kernel: ib1: Send unicast ARP to 016d
Dec  2 10:19:23 r6i1n8 kernel: ib0: Send unicast ARP to 016a
Dec  2 10:19:23 r6i1n8 kernel: ib_mthca 0000:06:00.0: CQ overrun on CQN 000080
Dec  2 10:19:23 r6i1n8 kernel: ib_mad: Fatal error (1) on MAD QP (1)
Dec  2 10:19:23 r6i1n8 kernel: cq_context = 0xffff8101b0ec1000
Dec  2 10:19:23 r6i1n8 kernel: flags = 0x90000900
Dec  2 10:19:23 r6i1n8 kernel: start_hi = 0x0
Dec  2 10:19:23 r6i1n8 kernel: start_lo = 0x0
Dec  2 10:19:23 r6i1n8 kernel: logsize_usrpage = 0xb000002
Dec  2 10:19:23 r6i1n8 kernel: comp_eqn = 0x1
Dec  2 10:19:23 r6i1n8 kernel: pd = 0x4
Dec  2 10:19:23 r6i1n8 kernel: lkey = 0x1300
Dec  2 10:19:23 r6i1n8 kernel: last_notified_index = 0x6972
Dec  2 10:19:23 r6i1n8 kernel: solicit_producer_index = 0x6173
Dec  2 10:19:23 r6i1n8 kernel: consumer_index = 0x0
Dec  2 10:19:23 r6i1n8 kernel: producer_index = 0x6973
Dec  2 10:19:23 r6i1n8 kernel: cqn = 0x80
Dec  2 10:19:23 r6i1n8 kernel: ci_db = 0x7fff
Dec  2 10:19:23 r6i1n8 kernel: state_db = 0x0
Dec  2 10:19:28 r6i1n8 kernel: ib0: Send unicast ARP to 0165
Dec  2 10:19:48 r6i1n8 kernel: ib0: Send unicast ARP to 0165
Dec  2 10:19:57 r6i1n8 kernel: ib_mad: Fatal error (1) on MAD QP (1)

(The CQ context table was dumped for debugging.)

And there was an example of a CM send CQ overrun in the mail I just 
sent to Eli (and ofa-general).

> Another thing you could try would be a 2.6.24-rc kernel (or an OFED
> 1.3 prerelease I guess), which has a change that moves all completions
> into one CQ in IPoIB.  This may fix the bug by accident.

The system was upgraded to OFED 1.3-alpha2, and now it's much more 
difficult to get the CQ overrun. (There are some overruns in the 
log files, but I can't seem to figure out how to reproduce them - 
it was much easier to get the CQ overruns with OFED 1.2 on the 
system.)

-- 
Arthur