[openib-general] OFED 1.1 IPoIB did not recover after a mthca catas recovery.

Ira Weiny weiny2 at llnl.gov
Thu Nov 9 16:45:12 PST 2006


We just had an "internal parity error" on a mellanox HCA.  The HCA recovered.  However, IPoIB did not fair as well.  We are not sure of the details.  What I have on the console is:

2006-11-09 15:20:05 ib_mthca 0000:07:00.0: Catastrophic error detected: internal parity error
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[00]: 05000014
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[01]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[02]: 00196240
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[03]: 00126618
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[04]: 00206128
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[05]: 001d6ff8
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[06]: ffffffff
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[07]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[08]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[09]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0a]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0b]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0c]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0d]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0e]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0f]: 00000000
2006-11-09 15:20:05 divert: no divert_blk to free, ib0 not ethernet
2006-11-09 15:20:05 divert: no divert_blk to free, ib1 not ethernet


ifconfig showed ib0 as "gone" (as in not listed).  We tried to ifup ib0 and got:

# zeus64 /root > ifup ib0
ib_ipoib
ib_ipoib device ib0 does not seem to be present, delaying initialization.


I then tried to unload the ib_ipoib module and that has hung for the last 15 min.

I have run ibv_rc_pingpong and ib_rdma_bw through the node fine.  ibstat and ibstatus and the switch show the link to be up.  So it appears as though the card recovered fine.

What can we do?

:-/

Thanks,
Ira




More information about the general mailing list