[ofa-general] IPoIB CQ overrun

akepner at sgi.com akepner at sgi.com
Mon Dec 3 17:40:21 PST 2007


On Sun, Nov 18, 2007 at 10:27:23AM +0200, Eli Cohen wrote:

> Can you tell how IPOIB is configured - connected mode or datagram mode?
> Also can you send more context from /var/log/messages? Especially can
> you rerun with debug enabled and send the output?
> Enabling debug can be done by:
> echo 1 > /sys/module/ib_ipoib/parameters/debug_level

Yes, it's connected mode.

Here another log of on overrun with "debug_level=1". I added code to dump 
the CQ context table (just did a QUERY_CQ and logged the result). 


15:50:39 r3i1n2 kernel: ib0: Send unicast ARP to 0165
15:50:42 r3i1n2 kernel: ib0: neigh_destructor for 000404 fe80:0000:0000:0000:0008:f104:0398:8595
15:50:42 r3i1n2 kernel: ib0: Reap connection for gid fe80:0000:0000:0000:0008:f104:0398:8595
15:50:42 r3i1n2 kernel: ib0: Destroy active connection 0xf048d head 0x2 tail 0x2
15:50:53 r3i1n2 in.rshd[7056]: connect from 10.148.0.9 (10.148.0.9)
15:50:53 r3i1n2 kernel: ib_mthca 0000:06:00.0: CQ overrun on CQN 240082
15:50:53 r3i1n2 kernel: cq_context = 0xffff8101eee9c000
15:50:53 r3i1n2 kernel: flags = 0x90000900
15:50:53 r3i1n2 kernel: start_hi = 0x0
15:50:53 r3i1n2 kernel: start_lo = 0x0
15:50:53 r3i1n2 kernel: logsize_usrpage = 0x7000002
15:50:53 r3i1n2 kernel: comp_eqn = 0x1
15:50:53 r3i1n2 kernel: pd = 0x4
15:50:53 r3i1n2 kernel: lkey = 0xd0108900
15:50:53 r3i1n2 kernel: last_notified_index = 0x217
15:50:53 r3i1n2 kernel: solicit_producer_index = 0x9c18
15:50:53 r3i1n2 kernel: consumer_index = 0x0
15:50:53 r3i1n2 kernel: producer_index = 0x218
15:50:53 r3i1n2 kernel: cqn = 0x240082
15:50:53 r3i1n2 kernel: ci_db = 0x7ffd
15:50:53 r3i1n2 kernel: state_db = 0x1
15:50:58 r3i1n2 kernel: ib1: Send unicast ARP to 016d
15:50:58 r3i1n2 kernel: ib0: Send unicast ARP to 0165
15:51:11 r3i1n2 in.rshd[7057]: connect from 10.148.0.9 (10.148.0.9)
15:51:27 r3i1n2 kernel: ib0: REQ arrived
15:51:31 r3i1n2 kernel: ib0: Send unicast ARP to 0165
15:51:32 r3i1n2 kernel: ib1: Send unicast ARP to 016d
15:51:32 r3i1n2 kernel: ib0: Send unicast ARP to 00ac
15:51:42 r3i1n2 in.rshd[7058]: connect from 10.148.0.9 (10.148.0.9)
15:52:12 r3i1n2 in.rlogind[7059]: connect from 10.148.0.9 (10.148.0.9)
15:52:17 r3i1n2 kernel: ib1: Send unicast ARP to 016d
15:52:22 r3i1n2 kernel: ib0: Send unicast ARP to 0165
15:52:54 r3i1n2 in.rlogind[7060]: connect from 192.168.159.1 (192.168.159.1)
15:52:54 r3i1n2 rlogind[7060]: pam_rhosts_auth(rlogin:auth): allowed to root at r3lead as root
15:52:59 r3i1n2 kernel: ib1: Send unicast ARP to 016d
15:53:11 r3i1n2 kernel: ib0: Send unicast ARP to 0165
15:53:32 r3i1n2 kernel: ib0: Send unicast ARP to 00ac
15:54:14 r3i1n2 kernel: ib0: Send unicast ARP to 0165
15:54:19 r3i1n2 kernel: ib1: Send unicast ARP to 016d
15:54:26 r3i1n2 kernel: ib_mthca 0000:06:00.0: mthca_create_cq: cq = 0xffff81015a3ee7c0 cqn = 0x350090
15:54:26 r3i1n2 kernel: ib0: ipoib_cm_tx_init: ib_create_cq returns 0xffff81022523b1c0
15:54:26 r3i1n2 kernel: ib0: Request connection 0x13048f for gid fe80:0000:0000:0000:0008:f104:0398:8595 qpn 0x404
15:54:26 r3i1n2 kernel: ib0: REP received.
15:54:43 r3i1n2 in.rshd[7061]: connect from 192.168.159.1 (192.168.159.1)
15:54:43 r3i1n2 rshd[7061]: pam_rhosts_auth(rsh:auth): allowed to root at r3lead as root
15:54:48 r3i1n2 kernel: ib0: Send unicast ARP to 0165
15:55:03 r3i1n2 login[4750]: resmgr: unable to connect to resmgrd: No such file or directory
15:55:03 r3i1n2 login[4750]: resmgr login failed
15:55:23 r3i1n2 kernel: ib0: Send unicast ARP to 0165
15:55:28 r3i1n2 kernel: ib1: Send unicast ARP to 016d
15:55:30 r3i1n2 kernel: ib0: TX ring 0xf00405 full, stopping kernel net queue
15:55:32 r3i1n2 kernel: NETDEV WATCHDOG: ib0: transmit timed out
15:55:32 r3i1n2 kernel: ib0: transmit timeout: latency 1688 msecs
15:55:32 r3i1n2 kernel: ib0: queue stopped 1, tx_head 13657, tx_tail 13657
15:55:33 r3i1n2 kernel: NETDEV WATCHDOG: ib0: transmit timed out
 

Looking at the contents of the CQ context table (right after the 
overrun at 15:50:53), do the producer and consumer indices look 
reasonable? I expected to find that producer_index + 1 == consumer_index.

-- 
Arthur



More information about the general mailing list