[openfabrics-ewg] NOP problem in ib_mthca on OFED RC4
Don.Albert at Bull.com
Don.Albert at Bull.com
Mon May 8 17:25:41 PDT 2006
Back in March I had a problem with initializing the ib_mthca driver on an
EM64T system. The module loading would give an error of "ib_mthca
0000:03:00.0: NOP command failed to generate interrupt (IRQ 169),
aborting." This appeared to be corrected when I updated the firmware on
the Mellanox MT25208 HCA card.
The problem has reappeared with the OFED release, on the same system, but
different software and a different HCA card.
I have a small testbed with two EM64T machines connected back-to-back with
two Mellanox MT25204 single port DDR cards. I was successfully running
the backported 2.6.9-34 kernel on RHEL4 Update 3, with a recent version of
the OpenIB tree. Both systems would come up and the cards successfully
initialized.
Over the weekend I moved to the 2.6.16 stock kernel, and then built and
installed the OFED-1.0-rc4 release. One of the systems appears to come
up ok, but the port stays in the "down" state. I assumed this was
because the other end of the link (the other machine) was not up.
The second machine boots, but I see the following in dmesg:
ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
ib_mthca: Initializing 0000:03:00.0
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 16 (level, low) -> IRQ 169
PCI: Setting latency timer of device 0000:03:00.0 to 64
ib_mthca 0000:03:00.0: NOP command failed to generate interrupt (IRQ
169), aborting.
ib_mthca 0000:03:00.0: BIOS or ACPI interrupt routing problem?
When I had the problem previously, Roland Drier suggested trying to load
the ib_mthca module with "fw_cmd_doorbell=0", which did avoid the error
then, and in fact does on this new problem. But the question is why?
Updating the firmware on the old board seemed to have solved the problem
before, but now it has occurred again on a fairly new card with recent
firmware. Has anyone else seen this problem?
One thing that may have a bearing on this is that the "/sbin/lspci"
command has also started issuing an error message relating to the PCI slot
that the HCA is in. Here is the message:
pcilib: Resource 2 in /sys/bus/pci/devices/0000:03:00.0/resource has a
64-bit address, ignoring
....
03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
Do I need a new version of pcilib? I currently have
pciutils-2.1.99.test8-3.1.
-Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060508/9289dc73/attachment.html>
More information about the ewg
mailing list