[openib-general] OFED 1.1 IPoIB did not recover after a mthca catas recovery.
Ira Weiny
weiny2 at llnl.gov
Thu Nov 9 16:45:12 PST 2006
We just had an "internal parity error" on a mellanox HCA. The HCA recovered. However, IPoIB did not fair as well. We are not sure of the details. What I have on the console is:
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: Catastrophic error detected: internal parity error
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[00]: 05000014
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[01]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[02]: 00196240
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[03]: 00126618
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[04]: 00206128
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[05]: 001d6ff8
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[06]: ffffffff
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[07]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[08]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[09]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0a]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0b]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0c]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0d]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0e]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0f]: 00000000
2006-11-09 15:20:05 divert: no divert_blk to free, ib0 not ethernet
2006-11-09 15:20:05 divert: no divert_blk to free, ib1 not ethernet
ifconfig showed ib0 as "gone" (as in not listed). We tried to ifup ib0 and got:
# zeus64 /root > ifup ib0
ib_ipoib
ib_ipoib device ib0 does not seem to be present, delaying initialization.
I then tried to unload the ib_ipoib module and that has hung for the last 15 min.
I have run ibv_rc_pingpong and ib_rdma_bw through the node fine. ibstat and ibstatus and the switch show the link to be up. So it appears as though the card recovered fine.
What can we do?
:-/
Thanks,
Ira
More information about the general
mailing list