[openfabrics-ewg] NOP problem in ib_mthca on OFED RC4

Don.Albert at Bull.com Don.Albert at Bull.com
Mon May 8 17:25:41 PDT 2006


Back in March I had a problem with initializing the ib_mthca driver on an 
EM64T system.   The module loading would give an error of "ib_mthca 
0000:03:00.0: NOP command failed to generate interrupt (IRQ 169), 
aborting."    This appeared to be corrected when I updated the firmware on 
the Mellanox MT25208 HCA card.

The problem has reappeared with the OFED release, on the same system,  but 
different software and a different HCA card.

I have a small testbed with two EM64T machines connected back-to-back with 
two Mellanox MT25204 single port DDR cards.   I was successfully running 
the backported 2.6.9-34 kernel on RHEL4 Update 3, with a recent version of 
the OpenIB tree.   Both systems would come up and the cards successfully 
initialized.

Over the weekend I moved to the 2.6.16 stock kernel,  and then built and 
installed the OFED-1.0-rc4 release.   One of the systems appears to come 
up ok, but the port stays in the "down" state.   I assumed this was 
because the other end of the link (the other machine) was not up.

The second machine boots, but I see the following in dmesg:

    ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
    ib_mthca: Initializing 0000:03:00.0
    ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 16 (level, low) -> IRQ 169
    PCI: Setting latency timer of device 0000:03:00.0 to 64
    ib_mthca 0000:03:00.0: NOP command failed to generate interrupt (IRQ 
169), aborting.
    ib_mthca 0000:03:00.0: BIOS or ACPI interrupt routing problem?

When I had the problem previously, Roland Drier suggested trying to load 
the ib_mthca module with "fw_cmd_doorbell=0",  which did avoid the error 
then,  and in fact does on this new problem.   But the question is why? 
Updating the firmware on the old board seemed to have solved the problem 
before, but now it has occurred again on a fairly new card with recent 
firmware.    Has anyone else seen this problem?

One thing that may have a bearing on this is that the "/sbin/lspci" 
command has also started issuing an error message relating to the PCI slot 
that the HCA is in.  Here is the message:

   pcilib: Resource 2 in /sys/bus/pci/devices/0000:03:00.0/resource has a 
64-bit address, ignoring
   ....
   03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx 
HCA] (rev 20)

Do I need a new version of pcilib?  I currently have 
pciutils-2.1.99.test8-3.1.

        -Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060508/9289dc73/attachment.html>


More information about the ewg mailing list