[ofa-general] Bad interaction between ofed, NFS and Gaussian

Tziporet Koren tziporet at dev.mellanox.co.il
Wed Sep 2 12:37:22 PDT 2009


BOYRIE Fabrice wrote:
>  Hello
>
>  Hoping I'm in the good mailing list.
>  I've a problem with ofed 1.4.2 on Centos 5.3.
>
>
>  We have a new cluster with QDR infiniband.
>  I've installed ofed from source using the install.pl script with the
>  default values.
>  I've used default kernel from Centos (2.6.18-128.7.1.el5)
>  When a node starts, openibd  and opensmd services are launched.
>
>
>  Infiniband is working
>
>  ibv_devinfo
>  hca_id: mlx4_0
>          fw_ver:                         2.6.000
>          node_guid:                      0002:c903:0004:3efc
>          sys_image_guid:                 0002:c903:0004:3eff
>          vendor_id:                      0x02c9
>          vendor_part_id:                 26428
>          hw_ver:                         0xA0
>          board_id:                       MT_0C40110009
>          phys_port_cnt:                  1
>                  port:   1
>                          state:                  PORT_ACTIVE (4)
>                          max_mtu:                2048 (4)
>                          active_mtu:             2048 (4)
>                          sm_lid:                 9
>                          port_lid:               17
>                          port_lmc:               0x00
>
>  If I launch MPI program, eg vasp, it works using infiniband transport
>  and the performance is good.
>
>    So no problem until I want to launch a program not using infiniband:
>  Gaussian.
>
>  With some big calculus and with %ncpu=8, Gaussian abort with the
>  following message
>  ntrbks: Input/output error
>    I launched several times Gaussian and it always aborted at the same
>  point.
>
>    If I launch the same gaussian on the same input file on our old
>  cluster (same Centos 5.3, same kernel, but without infiniband), it works.
>
>    Searching the source code for ntrbks shows a call to fstatfs.
>
>  So I've straced Gaussian on the two clusters. Here is the relevant
> part.
>
>  New cluster:
>
>  [pid  5715] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
>  ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
>  "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.i"...,"0",
>  "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.r"..., "0",
>  "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.d"..., "0",
>  "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.s"..., "0",
>  "/tmp/CpRh_H_Ph_EneTS1/Gau-5714.i"..., "0", "junk.out", "0", ...],
>  [/* 65 vars */] PANIC: attached pid 5816 exited with 0
>  [pid  5715] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
>  [pid  5715] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>  f_blocks=13562292, f_bfree=12353558, f_bavail=12353558,
>  f_files=434124416, f_ffree=434090957, f_fsid={0, 0}, f_namelen=255,
>  f_frsize =32768}) = 0
>  [pid  5715] read(5, "\10\0\0\0\0\0\0\0", 8) = 8
>  [pid  5715] read(5,
>  "\10\0\0\0\0\0\0\0\0\320\10\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0
>  \0\0"..., 320032) = 320032
>  [pid  5715] fstatfs(5, 0x7fff553be6d0)  = -1 EIO (Input/output error)
>
>
>  Old cluster:
>
>  [pid  8605] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
>  ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
>  "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", 
>  "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>  "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>  "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>  "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "junk.out", "0", ...], [/* 8
>  2 vars */]PANIC: attached pid 8701 exited with 0
>  [pid  8605] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
>
>  [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>  f_blocks=9150944, f_bfree=686850, f_bavail=686850, f_files=88123232,
>  f_ffree=87917195, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>  [pid  8605] read(5, "'\0\0\0\0\0\0\0", 8) = 8
>  [pid  8605] read(5,
>  "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
>  \0"..., 320032) = 320032
>  [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>  f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
>  f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>  [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>  f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
>  f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>  [pid  8605] write(5,
>  "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
>  \0"...,
>  320032) = 320032
> [pid  8605] close(5)                    = 0
>
>
>
>
>  If I put the input file on a local directory instead of a nfs one,
>  Gaussian works.
>  There is no messages in dmesg or in /var/log directory on the node or on
>  the nfs server.
>
>  On the node, /home is mounted as
>  192.168.1.100:/home on /home type nfs
>  (rw,nosuid,rsize=32768,proto=tcp,addr=192.168.1.100)
>
>  192.168.1.xxx is the ethernet network (the nfs server has not
>  infiniband card).
> On the node, it is enough to do
>  «ifconfig ib0 down
>   service opensmd stop
>  » to have Gaussian working on the nfs directory.
>
>  («ifconfig ib0 down» or «service opensmd stop» alone is not enough)
>
>
>  So it seems there is an interaction between nfs access and openfabric.
>  But why ? And how to solve it ?
>
>
>
>   
It seems issues of NFS/RDMA backports.
Can you install OFED without NFS/RDMA?
You can change the conf file for this

Jon/Steve/Jeff - are you familiar with this issue?

Tziporet



More information about the general mailing list