[openib-general] OFED 1.1 IPoIB did not recover after a mthca catas recovery.

Boris Shpolyansky boris at mellanox.com
Thu Nov 9 17:27:41 PST 2006


Ira,

I think our general recommendation is to reboot the machine once the HCA
has reported catastrophic error, since the device is in the fatal state
and wouldn't respond to any command from the host. 
However the gen-2 driver, i.e. ib_mthca, resets the HCA when it starts,
so restarting the driver may serve you just fine (unless you have a
persistent HW failure).

>From what you reported IPoIB doesn't seem to survive this, so it looks
like you still have to reboot your machine.

Regards,
Boris Shpolyansky
Application Engineer
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com


-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Ira Weiny
Sent: Thursday, November 09, 2006 4:45 PM
To: openib-general at openib.org
Cc: Roland Dreier; Trent D'Hooge
Subject: [openib-general] OFED 1.1 IPoIB did not recover after a mthca
catas recovery.

We just had an "internal parity error" on a mellanox HCA.  The HCA
recovered.  However, IPoIB did not fair as well.  We are not sure of the
details.  What I have on the console is:

2006-11-09 15:20:05 ib_mthca 0000:07:00.0: Catastrophic error detected:
internal parity error
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[00]: 05000014
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[01]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[02]: 00196240
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[03]: 00126618
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[04]: 00206128
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[05]: 001d6ff8
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[06]: ffffffff
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[07]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[08]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[09]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0a]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0b]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0c]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0d]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0e]: 00000000
2006-11-09 15:20:05 ib_mthca 0000:07:00.0:   buf[0f]: 00000000
2006-11-09 15:20:05 divert: no divert_blk to free, ib0 not ethernet
2006-11-09 15:20:05 divert: no divert_blk to free, ib1 not ethernet


ifconfig showed ib0 as "gone" (as in not listed).  We tried to ifup ib0
and got:

# zeus64 /root > ifup ib0
ib_ipoib
ib_ipoib device ib0 does not seem to be present, delaying
initialization.


I then tried to unload the ib_ipoib module and that has hung for the
last 15 min.

I have run ibv_rc_pingpong and ib_rdma_bw through the node fine.  ibstat
and ibstatus and the switch show the link to be up.  So it appears as
though the card recovered fine.

What can we do?

:-/

Thanks,
Ira

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general





More information about the general mailing list