[ofa-general] ib_mthca catastrophic error detected

Jack Morgenstein jackm at dev.mellanox.co.il
Thu Nov 6 01:54:01 PST 2008


On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
> Hi
> 
> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel module 
> reports the following on startup:
> 
> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
> 
> The cards in all (22) of the nodes we have seen this error on are as 
> follows:
> 
> hca_id: mthca0
>          fw_ver:                         1.2.0
>          vendor_id:                      0x02c9
>          vendor_part_id:                 25204
>          hw_ver:                         0xA0
>          board_id:                       MT_03B0140001
>          phys_port_cnt:                  1
> 
> It appears that when this happens the driver restarts (loads?) itself 
> however the job running at the time of the error is, of course, killed.
> 
> Scott

Scott,
We are trying to reproduce this here.  It would help if you could supply
the following info:

Host model for hosts which are experiencing the failure:
 
Console output from the following linux commands:
  cat /etc/*rel*
  cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using grub)
  uname -a
  cat /proc/cpuinfo
  cat /proc/meminfo

Also, what sort of job was running when the failure occurred:
-- which MPI are you using?
-- do you have a test example which we can run here to reproduce the problem?

Thanks in advance for your help!

Jack Morgenstein
Senior Software Development Engineer
Mellanox



More information about the general mailing list