[ofa-general] ib_mthca catastrophic error detected
Boris Shpolyansky
boris at mellanox.com
Mon Nov 10 12:50:13 PST 2008
OK, great!
Please, update us as soon as you have the entire cluster upgraded to the
new FW and have run more tests on it.
Thanks,
Boris Shpolyansky
Sr. Member of Technical Staff
Applications
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com
-----Original Message-----
From: Scott A. Friedman [mailto:friedman at ucla.edu]
Sent: Monday, November 10, 2008 11:45 AM
To: Boris Shpolyansky
Cc: Jack Morgenstein; Matthew Finlay; general at lists.openfabrics.org
Subject: Re: [ofa-general] ib_mthca catastrophic error detected
Hi
No, no boot over IB - in fact there is no IPoIB configured on this
cluster at all.
The firmware Matt sent seems to have fixed the problem as we have been
unable to reproduce since we flashed some test nodes. We are in the
process of flashing the remaining 100 or so nodes that have SDR cards as
jobs finish.
Scott
Boris Shpolyansky wrote:
> Scott,
>
> Do you use any form of Boot-over-IB in this cluster?
> If so - what version/flavor of it?
>
> Thanks,
> Boris Shpolyansky
> Sr. Member of Technical Staff
> Applications
> Mellanox Technologies Inc.
> 2900 Stender Way
> Santa Clara, CA 95054
> Tel.: (408) 916 0014
> Fax: (408) 970 3403
> Cell: (408) 834 9365
> www.mellanox.com
>
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Scott A.
> Friedman
> Sent: Thursday, November 06, 2008 10:35 AM
> To: Jack Morgenstein
> Cc: Matthew Finlay; general at lists.openfabrics.org
> Subject: Re: [ofa-general] ib_mthca catastrophic error detected
>
> Hi
>
> We have been working with Matthew Finlay <Matt at mellanox.com> on this
> recently - you/we might pull all of this together. We are able to make
> any of our sdr cards have a catastrophic error - and are unable to do
> the same with our ddr cards. Matt has suggested that there is a
firmware
>
> fix possibly?
>
> Anyway, to answer your questions:
>
> The hosts are Sun X2200M, but we have swapped a few around with some
> hosts we have from Aspen systems and the problem remains. I suppose
the
> similarity is that they are all nForce based.
>
> The MPI used was the latest OpenMPI - I will find the version, but I
do
> not think it matters whether we are using OpenMPI or MVAPICH.
>
> The job itself does not seem to matter either. The situation is after
a
> node comes up it takes a very long time for the card to become ACTIVE.
> It seems to ocsillate between ACTIVE and INIT. We have waited several
> minutes sometimes but can never be sure of when it will settle down.
The
>
> queue certainly doesn't know and a job submitted to such a node will
die
>
> as the cards will have a catastrophic error.
>
> Scott
>
>
> > Console output from the following linux commands:
> > cat /etc/*rel*
>
>
> Not a good idea...maybe this
>
> #cat /etc/redhat-release
> CentOS release 5 (Final)
>
> > cat /etc/lilo.conf , or: cat /boot/grub/menu.lst (if you are
using
>
> grub)
>
> # grub.conf generated by anaconda
> #
> # Note that you do not have to rerun grub after making changes to this
> file
> # NOTICE: You have a /boot partition. This means that
> # all kernel and initrd paths are relative to /boot/, eg.
> # root (hd0,0)
> # kernel /vmlinuz-version ro root=/dev/hda3
> # initrd /initrd-version.img
> #boot=/dev/hda
> default=0
> timeout=5
> splashimage=(hd0,0)/grub/splash.xpm.gz
> hiddenmenu
> title CentOS (2.6.18-92.1.6.el5)
> root (hd0,0)
> kernel /vmlinuz-2.6.18-92.1.6.el5 ro root=LABEL=/ rhgb quiet
> initrd /initrd-2.6.18-92.1.6.el5.img
>
>
> > uname -a
>
> Linux n141 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:45:47 EDT 2008
x86_64
> x86_64 x86_64 GNU/Linux
>
>
> > cat /proc/cpuinfo
> > cat /proc/meminfo
>
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz : 2200.000
> cache size : 512 KB
> physical id : 0
> siblings : 4
> core id : 0
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse
> 3dnowprefetch osvw
> bogomips : 4424.75
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
>
> processor : 1
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz : 2200.000
> cache size : 512 KB
> physical id : 0
> siblings : 4
> core id : 1
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse
> 3dnowprefetch osvw
> bogomips : 4426.22
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
>
> processor : 2
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz : 2200.000
> cache size : 512 KB
> physical id : 0
> siblings : 4
> core id : 2
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse
> 3dnowprefetch osvw
> bogomips : 4421.37
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
>
> processor : 3
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz : 2200.000
> cache size : 512 KB
> physical id : 0
> siblings : 4
> core id : 3
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse
> 3dnowprefetch osvw
> bogomips : 4421.65
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
>
> processor : 4
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz : 2200.000
> cache size : 512 KB
> physical id : 1
> siblings : 4
> core id : 0
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse
> 3dnowprefetch osvw
> bogomips : 4422.36
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
>
> processor : 5
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz : 2200.000
> cache size : 512 KB
> physical id : 1
> siblings : 4
> core id : 1
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse
> 3dnowprefetch osvw
> bogomips : 4422.71
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
>
> processor : 6
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz : 2200.000
> cache size : 512 KB
> physical id : 1
> siblings : 4
> core id : 2
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse
> 3dnowprefetch osvw
> bogomips : 4422.17
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
>
> processor : 7
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz : 2200.000
> cache size : 512 KB
> physical id : 1
> siblings : 4
> core id : 3
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse
> 3dnowprefetch osvw
> bogomips : 4422.17
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
>
>
>
>
> MemTotal: 8182568 kB
> MemFree: 4535892 kB
> Buffers: 318232 kB
> Cached: 1583772 kB
> SwapCached: 0 kB
> Active: 2714400 kB
> Inactive: 730260 kB
> HighTotal: 0 kB
> HighFree: 0 kB
> LowTotal: 8182568 kB
> LowFree: 4535892 kB
> SwapTotal: 8289532 kB
> SwapFree: 8289380 kB
> Dirty: 340 kB
> Writeback: 0 kB
> AnonPages: 1542636 kB
> Mapped: 14588 kB
> Slab: 139788 kB
> PageTables: 7208 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> CommitLimit: 12380816 kB
> Committed_AS: 1679420 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 4600 kB
> VmallocChunk: 34359733707 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> Hugepagesize: 2048 kB
>
>
>
> Jack Morgenstein wrote:
>> On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
>>> Hi
>>>
>>> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel
> module
>>> reports the following on startup:
>>>
>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>>
>>> The cards in all (22) of the nodes we have seen this error on are as
>>> follows:
>>>
>>> hca_id: mthca0
>>> fw_ver: 1.2.0
>>> vendor_id: 0x02c9
>>> vendor_part_id: 25204
>>> hw_ver: 0xA0
>>> board_id: MT_03B0140001
>>> phys_port_cnt: 1
>>>
>>> It appears that when this happens the driver restarts (loads?)
itself
>
>>> however the job running at the time of the error is, of course,
> killed.
>>> Scott
>> Scott,
>> We are trying to reproduce this here. It would help if you could
> supply
>> the following info:
>>
>> Host model for hosts which are experiencing the failure:
>>
>> Console output from the following linux commands:
>> cat /etc/*rel*
>> cat /etc/lilo.conf , or: cat /boot/grub/menu.lst (if you are using
> grub)
>> uname -a
>> cat /proc/cpuinfo
>> cat /proc/meminfo
>>
>> Also, what sort of job was running when the failure occurred:
>> -- which MPI are you using?
>> -- do you have a test example which we can run here to reproduce the
> problem?
>> Thanks in advance for your help!
>>
>> Jack Morgenstein
>> Senior Software Development Engineer
>> Mellanox
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list