[ofa-general] ib_mthca catastrophic error detected

Scott A. Friedman friedman at ucla.edu
Thu Nov 6 10:34:57 PST 2008


Hi

We have been working with Matthew Finlay <Matt at mellanox.com> on this 
recently - you/we might pull all of this together. We are able to make 
any of our sdr cards have a catastrophic error - and are unable to do 
the same with our ddr cards. Matt has suggested that there is a firmware 
fix possibly?

Anyway, to answer your questions:

The hosts are Sun X2200M, but we have swapped a few around with some 
hosts we have from Aspen systems and the problem remains. I suppose the 
similarity is that they are all nForce based.

The MPI used was the latest OpenMPI - I will find the version, but I do 
not think it matters whether we are using OpenMPI or MVAPICH.

The job itself does not seem to matter either. The situation is after a 
node comes up it takes a very long time for the card to become ACTIVE. 
It seems to ocsillate between ACTIVE and INIT. We have waited several 
minutes sometimes but can never be sure of when it will settle down. The 
queue certainly doesn't know and a job submitted to such a node will die 
as the cards will have a catastrophic error.

Scott


 > Console output from the following linux commands:
 >   cat /etc/*rel*


Not a good idea...maybe this

#cat /etc/redhat-release
CentOS release 5 (Final)

 >   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using 
grub)

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/hda3
#          initrd /initrd-version.img
#boot=/dev/hda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.18-92.1.6.el5)
  root (hd0,0)
  kernel /vmlinuz-2.6.18-92.1.6.el5 ro root=LABEL=/ rhgb quiet
  initrd /initrd-2.6.18-92.1.6.el5.img


 >   uname -a

Linux n141 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:45:47 EDT 2008 x86_64 
x86_64 x86_64 GNU/Linux


 >   cat /proc/cpuinfo
 >   cat /proc/meminfo

processor : 0
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 0
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4424.75
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 1
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 1
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4426.22
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 2
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 2
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4421.37
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 3
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 3
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4421.65
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 4
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 0
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.36
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 5
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 1
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.71
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 6
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 2
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 7
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 3
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]




MemTotal:      8182568 kB
MemFree:       4535892 kB
Buffers:        318232 kB
Cached:        1583772 kB
SwapCached:          0 kB
Active:        2714400 kB
Inactive:       730260 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      8182568 kB
LowFree:       4535892 kB
SwapTotal:     8289532 kB
SwapFree:      8289380 kB
Dirty:             340 kB
Writeback:           0 kB
AnonPages:     1542636 kB
Mapped:          14588 kB
Slab:           139788 kB
PageTables:       7208 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  12380816 kB
Committed_AS:  1679420 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      4600 kB
VmallocChunk: 34359733707 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB



Jack Morgenstein wrote:
> On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
>> Hi
>>
>> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel module 
>> reports the following on startup:
>>
>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>
>> The cards in all (22) of the nodes we have seen this error on are as 
>> follows:
>>
>> hca_id: mthca0
>>          fw_ver:                         1.2.0
>>          vendor_id:                      0x02c9
>>          vendor_part_id:                 25204
>>          hw_ver:                         0xA0
>>          board_id:                       MT_03B0140001
>>          phys_port_cnt:                  1
>>
>> It appears that when this happens the driver restarts (loads?) itself 
>> however the job running at the time of the error is, of course, killed.
>>
>> Scott
> 
> Scott,
> We are trying to reproduce this here.  It would help if you could supply
> the following info:
> 
> Host model for hosts which are experiencing the failure:
>  
> Console output from the following linux commands:
>   cat /etc/*rel*
>   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using grub)
>   uname -a
>   cat /proc/cpuinfo
>   cat /proc/meminfo
> 
> Also, what sort of job was running when the failure occurred:
> -- which MPI are you using?
> -- do you have a test example which we can run here to reproduce the problem?
> 
> Thanks in advance for your help!
> 
> Jack Morgenstein
> Senior Software Development Engineer
> Mellanox



More information about the general mailing list