[ofa-general] Problem with ConnectX HBA
Roland Fehrenbacher
rf at q-leap.de
Mon Jan 28 14:09:01 PST 2008
>>>>> "Tziporet" == Tziporet Koren <tziporet at dev.mellanox.co.il> writes:
Tziporet> Roland Fehrenbacher wrote:
>> Hi,
>>
>> when running MPI codes, we have the following error messages
>> coming from some of our servers running 2.6.22.16 with kernel
>> modules from ofa_kernel-1.2.5.4:
>>
>> mlx4_core 0000:08:00.0: SW2HW_MPT failed (-16)
>>
>> The communication on the corresponding machines is completely
>> blocked, and ibstat is just hanging.
>>
>> Any idea what could be wrong? Just for additional info: When
>> running the kernel with the original 2.6.22 drivers, I had
>> these kind of error messages at a much higher rate.
>>
>>
>>
Tziporet> What is the FW version you use?
# ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.3.0
Hardware version: 0
Node GUID: 0x0002c9020025a69c
System image GUID: 0x0002c9020025a69f
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 199
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9020025a69d
Tziporet> What is the type of machine used?
It is a dual Xeon (Quad core) on a 5000P chipset board.
Tziporet> Can you send us description how to reproduce?
I started a 100 node / 8 core = 800 processes mvapich job
(linpack). The issue occured after about 1 hour of runtime. A 50 node
/ 8 core = 400 processes mvapich job ran fine several times for more
than 36 hours (including the node on which this issue occured now).
Roland
More information about the general
mailing list