[ofa-general] Bad interaction between ofed, NFS and Gaussian

BOYRIE Fabrice Fabrice.Boyrie at univ-montp2.fr
Wed Sep 2 09:39:14 PDT 2009


 Hello

 Hoping I'm in the good mailing list.
 I've a problem with ofed 1.4.2 on Centos 5.3.


 We have a new cluster with QDR infiniband.
 I've installed ofed from source using the install.pl script with the
 default values.
 I've used default kernel from Centos (2.6.18-128.7.1.el5)
 When a node starts, openibd  and opensmd services are launched.


 Infiniband is working

 ibv_devinfo
 hca_id: mlx4_0
         fw_ver:                         2.6.000
         node_guid:                      0002:c903:0004:3efc
         sys_image_guid:                 0002:c903:0004:3eff
         vendor_id:                      0x02c9
         vendor_part_id:                 26428
         hw_ver:                         0xA0
         board_id:                       MT_0C40110009
         phys_port_cnt:                  1
                 port:   1
                         state:                  PORT_ACTIVE (4)
                         max_mtu:                2048 (4)
                         active_mtu:             2048 (4)
                         sm_lid:                 9
                         port_lid:               17
                         port_lmc:               0x00

 If I launch MPI program, eg vasp, it works using infiniband transport
 and the performance is good.

   So no problem until I want to launch a program not using infiniband:
 Gaussian.

 With some big calculus and with %ncpu=8, Gaussian abort with the
 following message
 ntrbks: Input/output error
   I launched several times Gaussian and it always aborted at the same
 point.

   If I launch the same gaussian on the same input file on our old
 cluster (same Centos 5.3, same kernel, but without infiniband), it works.

   Searching the source code for ntrbks shows a call to fstatfs.

 So I've straced Gaussian on the two clusters. Here is the relevant
part.

 New cluster:

 [pid  5715] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
 ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
 "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.i"...,"0",
 "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.r"..., "0",
 "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.d"..., "0",
 "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.s"..., "0",
 "/tmp/CpRh_H_Ph_EneTS1/Gau-5714.i"..., "0", "junk.out", "0", ...],
 [/* 65 vars */] PANIC: attached pid 5816 exited with 0
 [pid  5715] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
 [pid  5715] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
 f_blocks=13562292, f_bfree=12353558, f_bavail=12353558,
 f_files=434124416, f_ffree=434090957, f_fsid={0, 0}, f_namelen=255,
 f_frsize =32768}) = 0
 [pid  5715] read(5, "\10\0\0\0\0\0\0\0", 8) = 8
 [pid  5715] read(5,
 "\10\0\0\0\0\0\0\0\0\320\10\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0
 \0\0"..., 320032) = 320032
 [pid  5715] fstatfs(5, 0x7fff553be6d0)  = -1 EIO (Input/output error)


 Old cluster:

 [pid  8605] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
 ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
 "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", 
 "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
 "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
 "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
 "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "junk.out", "0", ...], [/* 8
 2 vars */]PANIC: attached pid 8701 exited with 0
 [pid  8605] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5

 [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
 f_blocks=9150944, f_bfree=686850, f_bavail=686850, f_files=88123232,
 f_ffree=87917195, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
 [pid  8605] read(5, "'\0\0\0\0\0\0\0", 8) = 8
 [pid  8605] read(5,
 "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
 \0"..., 320032) = 320032
 [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
 f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
 f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
 [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
 f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
 f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
 [pid  8605] write(5,
 "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
 \0"...,
 320032) = 320032
[pid  8605] close(5)                    = 0




 If I put the input file on a local directory instead of a nfs one,
 Gaussian works.
 There is no messages in dmesg or in /var/log directory on the node or on
 the nfs server.

 On the node, /home is mounted as
 192.168.1.100:/home on /home type nfs
 (rw,nosuid,rsize=32768,proto=tcp,addr=192.168.1.100)

 192.168.1.xxx is the ethernet network (the nfs server has not
 infiniband card).
On the node, it is enough to do
 «ifconfig ib0 down
  service opensmd stop
 » to have Gaussian working on the nfs directory.

 («ifconfig ib0 down» or «service opensmd stop» alone is not enough)


 So it seems there is an interaction between nfs access and openfabric.
 But why ? And how to solve it ?

Thanks in advance

Fabrice BOYRIE



More information about the general mailing list