[ofa-general] Problems using OFED 1.4 on largesmp nodes
Liang Zhen
Zhen.Liang at Sun.COM
Mon May 18 21:37:32 PDT 2009
Hi Ole,
Have you got solution for this? I think we got exactly same problem on
4600 with ofed-1.4.1-rc4:
lspci output:
03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe
2.0 2.5GT/s] (rev a0)
and error messages from dmesg:
mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
mlx4_core: Initializing 0000:03:00.0
mlx4_core 0000:03:00.0: Requested number of MACs is too much for port 1,
reducing to 1.
mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1
mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5)
mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting.
mlx4_core: probe of 0000:03:00.0 failed with error -5
Thanks
Liang
Ole Widar Saastad wrote:
> I have problems using the OFED 1.4 software on the Sun x4600 nodes.
> Need help to get this to work. We plan to run GPFS over IB on these
> nodes in addition to MPI.
>
> Sun 4600 nodes with 8 quad core cpus,
> Quad-Core AMD Opteron(tm) Processor 8380
>
> OS is Rocks release 4.
> centos-release-4-4.2/x86_64/
>
> Linux compute-0-0.local 2.6.9-67.0.15.ELlargesmp #1 SMP Thu May 8
> 11:03:57 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>
>
> Needless to say our 300+ nodes (SUN x2200 with quad core) runs fine with
> OFED 1.4 (and 1.3), they have the almost the same kernel :
> Linux compute-4-0.local 2.6.9-67.0.15.ELsmp #1 SMP Thu May 8 10:50:20
> EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
> Same except ELsmp and not ELlargesmp.
>
> More information:
>
> dmesg prints out the following error message :
>
> Losing some ticks... checking if CPU frequency changed.
> modulecmd[17499]: segfault at 0000007fc0b01688 rip 000000000060aa38 rsp 0000007fbfffcfd8 error 6
> mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
> mlx4_core: Initializing 0000:02:00.0
> ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 19 (level, low) -> IRQ 193
> PCI: Setting latency timer of device 0000:02:00.0 to 64
> mlx4_core 0000:02:00.0: Requested number of MACs is too much for port 1, reducing to 1.
> MSI INIT SUCCESS
> mlx4_core 0000:02:00.0: command 0x13 failed: fw status = 0x1
> mlx4_core 0000:02:00.0: SW2HW_EQ failed (-5)
> mlx4_core 0000:02:00.0: Failed to initialize event queue table, aborting.
> mlx4_core: probe of 0000:02:00.0 failed with error -5
>
> The following software is installed:
>
> Select Option [1-5]:3
> kernel-ib
> libibverbs
> libibverbs-devel
> libibverbs-utils
> libmthca
> libmlx4
> libcxgb3
> libnes
> libipathverbs
> libibcommon
> libibcommon-devel
> libibumad
> libibumad-devel
> ofed-docs
> ofed-scripts
> ibvexdmtools
> qlgc_vnic_daemon
>
>
> Just to be sure the card is present :
> lspci returns :
> 02:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0)
>
>
>
More information about the general
mailing list