[openib-general] Hardware error reporting

Daniel Ahlin dah at pdc.kth.se
Mon Nov 27 03:23:21 PST 2006


Hi

I'm currently evaluating a move from the IBGold stack to openib and
have a question about how hardware errors are reported/handled in the
openib kernel modules.

I've noticed that the openib modules refuse to use a card with present
memory errors, e.g:

ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
ib_mthca: Initializing 0000:08:00.0
ACPI: PCI Interrupt 0000:08:00.0[A] -> GSI 16 (level, low) -> IRQ 169
PCI: Setting latency timer of device 0000:08:00.0 to 64
ib_mthca 0000:08:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM
ib_mthca 0000:08:00.0: SYS_EN returned status 0x07, aborting.

which is good enough for me. Now to some questions:

1. Can I expect the same kind of verbosity when a memory error occurs
   after the modules have been loaded?

2. Dito for non-critical errors? (*) 

3. I understand that answers to (1) and (2) may depend on hardware
   used, but are there any plans to have reasonably unified error
   reporting?

Grateful for any answers, regards
Daniel Ahlin
PDC

(*) I have a card for which the IBGold stack regularly reports:

    THH(1): handle_ecc_event: Got ECC_DETECT event

which I guess is a correctable ECC error. When using this card with
the openib stack I get no warnings. This may be an example of (2) not
being the case (by design or bug).




More information about the general mailing list