[ofa-general] Problem with ConnectX HBA

Roland Fehrenbacher rf at q-leap.de
Mon Jan 28 14:09:01 PST 2008


>>>>> "Tziporet" == Tziporet Koren <tziporet at dev.mellanox.co.il> writes:

    Tziporet> Roland Fehrenbacher wrote:
    >> Hi,
    >> 
    >> when running MPI codes, we have the following error messages
    >> coming from some of our servers running 2.6.22.16 with kernel
    >> modules from ofa_kernel-1.2.5.4:
    >> 
    >> mlx4_core 0000:08:00.0: SW2HW_MPT failed (-16)
    >> 
    >> The communication on the corresponding machines is completely
    >> blocked, and ibstat is just hanging.
    >> 
    >> Any idea what could be wrong? Just for additional info: When
    >> running the kernel with the original 2.6.22 drivers, I had
    >> these kind of error messages at a much higher rate.
    >> 
    >> 
    >> 
    Tziporet> What is the FW version you use?

# ibstat
CA 'mlx4_0'
        CA type: MT25418
        Number of ports: 2
        Firmware version: 2.3.0
        Hardware version: 0
        Node GUID: 0x0002c9020025a69c
        System image GUID: 0x0002c9020025a69f
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 199
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510868
                Port GUID: 0x0002c9020025a69d



    Tziporet>   What is the type of machine used?

It is a dual Xeon (Quad core) on a 5000P chipset board.

    Tziporet> Can you send us description how to reproduce?

I started a 100 node / 8 core = 800 processes mvapich job
(linpack). The issue occured after about 1 hour of runtime. A 50 node
/ 8 core = 400 processes mvapich job ran fine several times for more
than 36 hours (including the node on which this issue occured now).

Roland



More information about the general mailing list