SPAM Re: [ofa-general] Problem with ConnectX HBA

Wed Aug 20 17:55:14 PDT 2008

>> >>>>> "Tziporet" == Tziporet Koren <tziporet at dev.mellanox.co.il> writes:
>>
>>     Tziporet> Roland Fehrenbacher wrote:
>>     >> Hi,
>>     >>
>>     >> when running MPI codes, we have the following error messages
>>     >> coming from some of our servers running 2.6.22.16 with kernel
>>     >> modules from ofa_kernel-1.2.5.4:
>>     >>
>>     >> mlx4_core 0000:08:00.0: SW2HW_MPT failed (-16)
>>     >>
>>     >> The communication on the corresponding machines is completely
>>     >> blocked, and ibstat is just hanging.
>>     >>
>>     >> Any idea what could be wrong? Just for additional info: When
>>     >> running the kernel with the original 2.6.22 drivers, I had
>>     >> these kind of error messages at a much higher rate.
>>     >>
>>     >>
>>     >>
>>     Tziporet> What is the FW version you use?
>>
>> # ibstat
>> CA 'mlx4_0'
>>         CA type: MT25418
>>         Number of ports: 2
>>         Firmware version: 2.3.0
>>         Hardware version: 0
>>         Node GUID: 0x0002c9020025a69c
>>         System image GUID: 0x0002c9020025a69f
>>         Port 1:
>>                 State: Active
>>                 Physical state: LinkUp
>>                 Rate: 20
>>                 Base lid: 199
>>                 LMC: 0
>>                 SM lid: 1
>>                 Capability mask: 0x02510868
>>                 Port GUID: 0x0002c9020025a69d
>>
>>
>>
>>     Tziporet>   What is the type of machine used?
>>
>> It is a dual Xeon (Quad core) on a 5000P chipset board.
>>
>>     Tziporet> Can you send us description how to reproduce?
>>
>> I started a 100 node / 8 core = 800 processes mvapich job
>> (linpack). The issue occured after about 1 hour of runtime. A 50 node
>> / 8 core = 400 processes mvapich job ran fine several times for more
>> than 36 hours (including the node on which this issue occured now).
>>
>> Roland

Hi Roland,

I am having the same problem.

After running an MPI job over mvapich2-1.0.3 on 8 nodes (64 CPU cores
total) my application crashes with the following error:
  [0] Abort: [] Got completion with error 12, vendor code=81, dest rank=61
   at line 546 in file ibv_channel_manager.c

After this an HCA on one of the nodes goes down and the node behaves
just as you described:
print a bunch of "mlx4_core ... SW2HW_MPT failed" to /var/log/kern.log

I have the 2.6.24-etchnhalf.1-amd64 kernel,  libibverbs 1.1.2, and
librdmacm 1.0.7.

My InfiniBand HCAs are same as your's (the hardware was put together by Verari):

CA 'mlx4_0'
        CA type: MT25418
        Number of ports: 2
        Firmware version: 2.3.0
        Hardware version: 0
        Node GUID: 0x0002c9030000a910
        System image GUID: 0x0002c9030000a913
        Port 1:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510868
                Port GUID: 0x0002c9030000a911
        Port 2:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 17
                LMC: 0
                SM lid: 16
                Capability mask: 0x0251086a
                Port GUID: 0x0002c9030000a912

I am currently working on the approach described here:
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/Doc/Compile.html#Compilers
But, hacking out all the system(), fork(), or popen() in my
application (NAMD2 with Charm++) is very hard.

Another approach that I might attempt is to get my parallel
application to run with RDMA bypassing MPI.
That seems to be also possible, just by looking at the application's
compilation options.

Were you able to solve this problem?

Alex

-- 
--------------------------------------------
Aleksandr Levchuk
University of California, Riverside
1-951-368-0004
--------------------------------------------

***SPAM*** Re: [ofa-general] Problem with ConnectX HBA

SPAM Re: [ofa-general] Problem with ConnectX HBA