[ofa-general] ib_mthca catastrophic error detected

Boris Shpolyansky boris at mellanox.com
Mon Nov 10 11:24:36 PST 2008


Scott,

Do you use any form of Boot-over-IB in this cluster?
If so - what version/flavor of it?

Thanks,
Boris Shpolyansky
Sr. Member of Technical Staff
Applications
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Scott A.
Friedman
Sent: Thursday, November 06, 2008 10:35 AM
To: Jack Morgenstein
Cc: Matthew Finlay; general at lists.openfabrics.org
Subject: Re: [ofa-general] ib_mthca catastrophic error detected

Hi

We have been working with Matthew Finlay <Matt at mellanox.com> on this 
recently - you/we might pull all of this together. We are able to make 
any of our sdr cards have a catastrophic error - and are unable to do 
the same with our ddr cards. Matt has suggested that there is a firmware

fix possibly?

Anyway, to answer your questions:

The hosts are Sun X2200M, but we have swapped a few around with some 
hosts we have from Aspen systems and the problem remains. I suppose the 
similarity is that they are all nForce based.

The MPI used was the latest OpenMPI - I will find the version, but I do 
not think it matters whether we are using OpenMPI or MVAPICH.

The job itself does not seem to matter either. The situation is after a 
node comes up it takes a very long time for the card to become ACTIVE. 
It seems to ocsillate between ACTIVE and INIT. We have waited several 
minutes sometimes but can never be sure of when it will settle down. The

queue certainly doesn't know and a job submitted to such a node will die

as the cards will have a catastrophic error.

Scott


 > Console output from the following linux commands:
 >   cat /etc/*rel*


Not a good idea...maybe this

#cat /etc/redhat-release
CentOS release 5 (Final)

 >   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using

grub)

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this
file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/hda3
#          initrd /initrd-version.img
#boot=/dev/hda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.18-92.1.6.el5)
  root (hd0,0)
  kernel /vmlinuz-2.6.18-92.1.6.el5 ro root=LABEL=/ rhgb quiet
  initrd /initrd-2.6.18-92.1.6.el5.img


 >   uname -a

Linux n141 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:45:47 EDT 2008 x86_64 
x86_64 x86_64 GNU/Linux


 >   cat /proc/cpuinfo
 >   cat /proc/meminfo

processor : 0
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 0
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4424.75
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 1
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 1
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4426.22
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 2
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 2
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4421.37
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 3
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 3
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4421.65
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 4
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 0
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.36
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 5
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 1
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.71
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 6
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 2
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 7
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 3
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]




MemTotal:      8182568 kB
MemFree:       4535892 kB
Buffers:        318232 kB
Cached:        1583772 kB
SwapCached:          0 kB
Active:        2714400 kB
Inactive:       730260 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      8182568 kB
LowFree:       4535892 kB
SwapTotal:     8289532 kB
SwapFree:      8289380 kB
Dirty:             340 kB
Writeback:           0 kB
AnonPages:     1542636 kB
Mapped:          14588 kB
Slab:           139788 kB
PageTables:       7208 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  12380816 kB
Committed_AS:  1679420 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      4600 kB
VmallocChunk: 34359733707 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB



Jack Morgenstein wrote:
> On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
>> Hi
>>
>> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel
module 
>> reports the following on startup:
>>
>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>
>> The cards in all (22) of the nodes we have seen this error on are as 
>> follows:
>>
>> hca_id: mthca0
>>          fw_ver:                         1.2.0
>>          vendor_id:                      0x02c9
>>          vendor_part_id:                 25204
>>          hw_ver:                         0xA0
>>          board_id:                       MT_03B0140001
>>          phys_port_cnt:                  1
>>
>> It appears that when this happens the driver restarts (loads?) itself

>> however the job running at the time of the error is, of course,
killed.
>>
>> Scott
> 
> Scott,
> We are trying to reproduce this here.  It would help if you could
supply
> the following info:
> 
> Host model for hosts which are experiencing the failure:
>  
> Console output from the following linux commands:
>   cat /etc/*rel*
>   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using
grub)
>   uname -a
>   cat /proc/cpuinfo
>   cat /proc/meminfo
> 
> Also, what sort of job was running when the failure occurred:
> -- which MPI are you using?
> -- do you have a test example which we can run here to reproduce the
problem?
> 
> Thanks in advance for your help!
> 
> Jack Morgenstein
> Senior Software Development Engineer
> Mellanox
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list