[openib-general] Re: Causes of interrupt problems?
Michael S. Tsirkin
mst at mellanox.co.il
Sat Mar 19 14:42:12 PST 2005
Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: Causes of interrupt problems?
>
> > What would cause the following?
>
> > ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004)
> > ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:04:00.0)
> > ib_mthca 0000:04:00.0: NOP command failed to generate interrupt, aborting.
> > ib_mthca 0000:04:00.0: BIOS or ACPI interrupt routing problem?
>
> > I've seen this on two Opteron systems, one Tyan board, one Rioworks
> > HDAMA. Is there some bios setting I should look for? Things are working
> > fine on another Rioworks HDAMA board.
>
> It seems that the fact that the HCA appears as a PCI device with a
> huge BAR behind a PCI bridge confuses some BIOS/ACPI implementations.
>
> Looking at that error message I realize it might be nice to be able to
> see what IRQ the driver is trying. If you change the line in
> mthca_main.c that prints the error to something like
>
> mthca_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n",
> dev->mthca_flags & MTHCA_FLAG_MSI_X ?
> dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector :
> dev->pdev->irq);
>
> then you can see what IRQ the HCA driver is trying. Then you can put
> another device like an ethernet in the same PCI slot and (assuming
> that the device works) compare the IRQ it is using with the one that
> mthca saw. If they're different then most likely you have a BIOS/ACPI
> problem. Unfortunately I'm not much good at fixing that sort of
> thing. The only thing I know to try is looking for a newer BIOS version.
>
> Other things to check: do the two HDAMA boards have the same BIOS
> revision? Is the HCA in the same slot in both boards?
>
> - R.
Another sort of problem one sometimes sees is hardware related spurios
interrupt asserts, as the result the IRQ finally gets disabled by the kernel.
Once you have the IRQ number, please try to look in /var/log/messages
whether this interrupt was disabled by the kernel. These are messages
like "no one cared".
--
MST - Michael S. Tsirkin
More information about the general
mailing list