[ofa-general] Bad interaction between ofed, NFS and Gaussian
Tziporet Koren
tziporet at dev.mellanox.co.il
Wed Sep 2 12:37:22 PDT 2009
BOYRIE Fabrice wrote:
> Hello
>
> Hoping I'm in the good mailing list.
> I've a problem with ofed 1.4.2 on Centos 5.3.
>
>
> We have a new cluster with QDR infiniband.
> I've installed ofed from source using the install.pl script with the
> default values.
> I've used default kernel from Centos (2.6.18-128.7.1.el5)
> When a node starts, openibd and opensmd services are launched.
>
>
> Infiniband is working
>
> ibv_devinfo
> hca_id: mlx4_0
> fw_ver: 2.6.000
> node_guid: 0002:c903:0004:3efc
> sys_image_guid: 0002:c903:0004:3eff
> vendor_id: 0x02c9
> vendor_part_id: 26428
> hw_ver: 0xA0
> board_id: MT_0C40110009
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 9
> port_lid: 17
> port_lmc: 0x00
>
> If I launch MPI program, eg vasp, it works using infiniband transport
> and the performance is good.
>
> So no problem until I want to launch a program not using infiniband:
> Gaussian.
>
> With some big calculus and with %ncpu=8, Gaussian abort with the
> following message
> ntrbks: Input/output error
> I launched several times Gaussian and it always aborted at the same
> point.
>
> If I launch the same gaussian on the same input file on our old
> cluster (same Centos 5.3, same kernel, but without infiniband), it works.
>
> Searching the source code for ntrbks shows a call to fstatfs.
>
> So I've straced Gaussian on the two clusters. Here is the relevant
> part.
>
> New cluster:
>
> [pid 5715] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
> ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
> "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.i"...,"0",
> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.r"..., "0",
> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.d"..., "0",
> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.s"..., "0",
> "/tmp/CpRh_H_Ph_EneTS1/Gau-5714.i"..., "0", "junk.out", "0", ...],
> [/* 65 vars */] PANIC: attached pid 5816 exited with 0
> [pid 5715] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
> [pid 5715] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
> f_blocks=13562292, f_bfree=12353558, f_bavail=12353558,
> f_files=434124416, f_ffree=434090957, f_fsid={0, 0}, f_namelen=255,
> f_frsize =32768}) = 0
> [pid 5715] read(5, "\10\0\0\0\0\0\0\0", 8) = 8
> [pid 5715] read(5,
> "\10\0\0\0\0\0\0\0\0\320\10\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0"..., 320032) = 320032
> [pid 5715] fstatfs(5, 0x7fff553be6d0) = -1 EIO (Input/output error)
>
>
> Old cluster:
>
> [pid 8605] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
> ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
> "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "junk.out", "0", ...], [/* 8
> 2 vars */]PANIC: attached pid 8701 exited with 0
> [pid 8605] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
>
> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
> f_blocks=9150944, f_bfree=686850, f_bavail=686850, f_files=88123232,
> f_ffree=87917195, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
> [pid 8605] read(5, "'\0\0\0\0\0\0\0", 8) = 8
> [pid 8605] read(5,
> "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0"..., 320032) = 320032
> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
> f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
> f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
> f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
> f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
> [pid 8605] write(5,
> "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0"...,
> 320032) = 320032
> [pid 8605] close(5) = 0
>
>
>
>
> If I put the input file on a local directory instead of a nfs one,
> Gaussian works.
> There is no messages in dmesg or in /var/log directory on the node or on
> the nfs server.
>
> On the node, /home is mounted as
> 192.168.1.100:/home on /home type nfs
> (rw,nosuid,rsize=32768,proto=tcp,addr=192.168.1.100)
>
> 192.168.1.xxx is the ethernet network (the nfs server has not
> infiniband card).
> On the node, it is enough to do
> «ifconfig ib0 down
> service opensmd stop
> » to have Gaussian working on the nfs directory.
>
> («ifconfig ib0 down» or «service opensmd stop» alone is not enough)
>
>
> So it seems there is an interaction between nfs access and openfabric.
> But why ? And how to solve it ?
>
>
>
>
It seems issues of NFS/RDMA backports.
Can you install OFED without NFS/RDMA?
You can change the conf file for this
Jon/Steve/Jeff - are you familiar with this issue?
Tziporet
More information about the general
mailing list