[ofa-general] Bad interaction between ofed, NFS and Gaussian

Jeff Becker Jeffrey.C.Becker at nasa.gov
Wed Sep 2 14:15:49 PDT 2009


Tziporet Koren wrote:
> BOYRIE Fabrice wrote:
>   
>>  Hello
>>
>>  Hoping I'm in the good mailing list.
>>  I've a problem with ofed 1.4.2 on Centos 5.3.
>>     

Salut Fabrice!

Does it also happen with OFED 1.5 alpha? Thanks.

-jeff

>>
>>  We have a new cluster with QDR infiniband.
>>  I've installed ofed from source using the install.pl script with the
>>  default values.
>>  I've used default kernel from Centos (2.6.18-128.7.1.el5)
>>  When a node starts, openibd  and opensmd services are launched.
>>
>>
>>  Infiniband is working
>>
>>  ibv_devinfo
>>  hca_id: mlx4_0
>>          fw_ver:                         2.6.000
>>          node_guid:                      0002:c903:0004:3efc
>>          sys_image_guid:                 0002:c903:0004:3eff
>>          vendor_id:                      0x02c9
>>          vendor_part_id:                 26428
>>          hw_ver:                         0xA0
>>          board_id:                       MT_0C40110009
>>          phys_port_cnt:                  1
>>                  port:   1
>>                          state:                  PORT_ACTIVE (4)
>>                          max_mtu:                2048 (4)
>>                          active_mtu:             2048 (4)
>>                          sm_lid:                 9
>>                          port_lid:               17
>>                          port_lmc:               0x00
>>
>>  If I launch MPI program, eg vasp, it works using infiniband transport
>>  and the performance is good.
>>
>>    So no problem until I want to launch a program not using infiniband:
>>  Gaussian.
>>
>>  With some big calculus and with %ncpu=8, Gaussian abort with the
>>  following message
>>  ntrbks: Input/output error
>>    I launched several times Gaussian and it always aborted at the same
>>  point.
>>
>>    If I launch the same gaussian on the same input file on our old
>>  cluster (same Centos 5.3, same kernel, but without infiniband), it works.
>>
>>    Searching the source code for ntrbks shows a call to fstatfs.
>>
>>  So I've straced Gaussian on the two clusters. Here is the relevant
>> part.
>>
>>  New cluster:
>>
>>  [pid  5715] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
>>  ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
>>  "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.i"...,"0",
>>  "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.r"..., "0",
>>  "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.d"..., "0",
>>  "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.s"..., "0",
>>  "/tmp/CpRh_H_Ph_EneTS1/Gau-5714.i"..., "0", "junk.out", "0", ...],
>>  [/* 65 vars */] PANIC: attached pid 5816 exited with 0
>>  [pid  5715] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
>>  [pid  5715] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>>  f_blocks=13562292, f_bfree=12353558, f_bavail=12353558,
>>  f_files=434124416, f_ffree=434090957, f_fsid={0, 0}, f_namelen=255,
>>  f_frsize =32768}) = 0
>>  [pid  5715] read(5, "\10\0\0\0\0\0\0\0", 8) = 8
>>  [pid  5715] read(5,
>>  "\10\0\0\0\0\0\0\0\0\320\10\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0
>>  \0\0"..., 320032) = 320032
>>  [pid  5715] fstatfs(5, 0x7fff553be6d0)  = -1 EIO (Input/output error)
>>
>>
>>  Old cluster:
>>
>>  [pid  8605] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe",
>>  ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200",
>>  "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", 
>>  "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>>  "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>>  "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0",
>>  "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "junk.out", "0", ...], [/* 8
>>  2 vars */]PANIC: attached pid 8701 exited with 0
>>  [pid  8605] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5
>>
>>  [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>>  f_blocks=9150944, f_bfree=686850, f_bavail=686850, f_files=88123232,
>>  f_ffree=87917195, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>>  [pid  8605] read(5, "'\0\0\0\0\0\0\0", 8) = 8
>>  [pid  8605] read(5,
>>  "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
>>  \0"..., 320032) = 320032
>>  [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>>  f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
>>  f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>>  [pid  8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768,
>>  f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232,
>>  f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
>>  [pid  8605] write(5,
>>  "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0
>>  \0"...,
>>  320032) = 320032
>> [pid  8605] close(5)                    = 0
>>
>>
>>
>>
>>  If I put the input file on a local directory instead of a nfs one,
>>  Gaussian works.
>>  There is no messages in dmesg or in /var/log directory on the node or on
>>  the nfs server.
>>
>>  On the node, /home is mounted as
>>  192.168.1.100:/home on /home type nfs
>>  (rw,nosuid,rsize=32768,proto=tcp,addr=192.168.1.100)
>>
>>  192.168.1.xxx is the ethernet network (the nfs server has not
>>  infiniband card).
>> On the node, it is enough to do
>>  «ifconfig ib0 down
>>   service opensmd stop
>>  » to have Gaussian working on the nfs directory.
>>
>>  («ifconfig ib0 down» or «service opensmd stop» alone is not enough)
>>
>>
>>  So it seems there is an interaction between nfs access and openfabric.
>>  But why ? And how to solve it ?
>>
>>
>>
>>   
>>     
> It seems issues of NFS/RDMA backports.
> Can you install OFED without NFS/RDMA?
> You can change the conf file for this
>
> Jon/Steve/Jeff - are you familiar with this issue?
>
> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   




More information about the general mailing list