[ofa-general] Bad interaction between ofed, NFS and Gaussian
Jeff Becker
Jeffrey.C.Becker at nasa.gov
Wed Sep 2 14:15:49 PDT 2009
Tziporet Koren wrote:
> BOYRIE Fabrice wrote:
>
>> Hello
>>
>> Hoping I'm in the good mailing list.
>> I've a problem with ofed 1.4.2 on Centos 5.3.
>>
Salut Fabrice!
Does it also happen with OFED 1.5 alpha? Thanks.
-jeff
>>
>> We have a new cluster with QDR infiniband.
>> I've installed ofed from source using the install.pl script with the
>> default values.
>> I've used default kernel from Centos (2.6.18-128.7.1.el5)
>> When a node starts, openibd and opensmd services are launched.
>>
>>
>> Infiniband is working
>>
>> ibv_devinfo
>> hca_id: mlx4_0
>> fw_ver: 2.6.000
>> node_guid: 0002:c903:0004:3efc
>> sys_image_guid: 0002:c903:0004:3eff
>> vendor_id: 0x02c9
>> vendor_part_id: 26428
>> hw_ver: 0xA0
>> board_id: MT_0C40110009
>> phys_port_cnt: 1
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 2048 (4)
>> active_mtu: 2048 (4)
>> sm_lid: 9
>> port_lid: 17
>> port_lmc: 0x00
>>
>> If I launch MPI program, eg vasp, it works using infiniband transport
>> and the performance is good.
>>
>> So no problem until I want to launch a program not using infiniband:
>> Gaussian.
>>
>> With some big calculus and with %ncpu=8, Gaussian abort with the
>> following message
>> ntrbks: Input/output error
>> I launched several times Gaussian and it always aborted at the same
>> point.
>>
>> If I launch the same gaussian on the same input file on our old
>> cluster (same Centos 5.3, same kernel, but without infiniband), it works.
>>
>> Searching the source code for ntrbks shows a call to fstatfs.
>>
>> So I've straced Gaussian on the two clusters. Here is the relevant
>> part.
>>
>> New cluster:
>>
>> [pid 5715] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
>> ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
>> "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.i"...,"0",
>> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.r"..., "0",
>> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.d"..., "0",
>> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.s"..., "0",
>> "/tmp/CpRh_H_Ph_EneTS1/Gau-5714.i"..., "0", "junk.out", "0", ...],
>> [/* 65 vars */] PANIC: attached pid 5816 exited with 0
>> [pid 5715] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
>> [pid 5715] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>> f_blocks=13562292, f_bfree=12353558, f_bavail=12353558,
>> f_files=434124416, f_ffree=434090957, f_fsid={0, 0}, f_namelen=255,
>> f_frsize =32768}) = 0
>> [pid 5715] read(5, "\10\0\0\0\0\0\0\0", 8) = 8
>> [pid 5715] read(5,
>> "\10\0\0\0\0\0\0\0\0\320\10\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0
>> \0\0"..., 320032) = 320032
>> [pid 5715] fstatfs(5, 0x7fff553be6d0) = -1 EIO (Input/output error)
>>
>>
>> Old cluster:
>>
>> [pid 8605] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
>> ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
>> "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "junk.out", "0", ...], [/* 8
>> 2 vars */]PANIC: attached pid 8701 exited with 0
>> [pid 8605] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
>>
>> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>> f_blocks=9150944, f_bfree=686850, f_bavail=686850, f_files=88123232,
>> f_ffree=87917195, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>> [pid 8605] read(5, "'\0\0\0\0\0\0\0", 8) = 8
>> [pid 8605] read(5,
>> "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
>> \0"..., 320032) = 320032
>> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>> f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
>> f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>> f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
>> f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>> [pid 8605] write(5,
>> "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
>> \0"...,
>> 320032) = 320032
>> [pid 8605] close(5) = 0
>>
>>
>>
>>
>> If I put the input file on a local directory instead of a nfs one,
>> Gaussian works.
>> There is no messages in dmesg or in /var/log directory on the node or on
>> the nfs server.
>>
>> On the node, /home is mounted as
>> 192.168.1.100:/home on /home type nfs
>> (rw,nosuid,rsize=32768,proto=tcp,addr=192.168.1.100)
>>
>> 192.168.1.xxx is the ethernet network (the nfs server has not
>> infiniband card).
>> On the node, it is enough to do
>> «ifconfig ib0 down
>> service opensmd stop
>> » to have Gaussian working on the nfs directory.
>>
>> («ifconfig ib0 down» or «service opensmd stop» alone is not enough)
>>
>>
>> So it seems there is an interaction between nfs access and openfabric.
>> But why ? And how to solve it ?
>>
>>
>>
>>
>>
> It seems issues of NFS/RDMA backports.
> Can you install OFED without NFS/RDMA?
> You can change the conf file for this
>
> Jon/Steve/Jeff - are you familiar with this issue?
>
> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list