[ofa-general] infiniband problem, no NICs

Michael Oevermann michael.oevermann at tu-berlin.de
Thu Nov 20 11:41:44 PST 2008


Hi all,

I have "inherited" a small cluster with a head node and four compute
nodes which I have to administer.  The nodes are connected via infiniband (OFED), but the head is not. 
I am a complete novice to the infiniband stuff and here is my problem:

The infiniband configuration seems to be OK. The usual tests suggested in the OFED install guide give 
the expected output, e.g.


ibv_devinfo on the nodes:


************************* oscar_cluster *************************
--------- n01---------
hca_id: mthca0
fw_ver: 1.2.0
node_guid: 0002:c902:0025:930c
sys_image_guid: 0002:c902:0025:930f
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_03B0140001
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 1
port_lmc: 0x00

etc. for the other nodes.

sminfo on the nodes:

************************* oscar_cluster *************************
--------- n01---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6881 priority 0 
state 3 SMINFO_MASTER
--------- n02---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6882 priority 0 
state 3 SMINFO_MASTER
--------- n03---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6883 priority 0 
state 3 SMINFO_MASTER
--------- n04---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6884 priority 0 
state 3 SMINFO_MASTER



However, when I directly start a mpi job (without using a scheduler) via:

/usr/mpi/gcc4/openmpi-1.2.2-1/bin/mpirun -np 4 -hostfile 
/home/sysgen/infiniband-mpi-test/machine/usr/mpi/gcc4/openmpi-1.2.2-1/tests/IMB-2.3/IMB-MPI1

I get the error message:

0,1,0]: uDAPL on host n01 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,2]: uDAPL on host n01 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,3]: uDAPL on host n02 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,1]: uDAPL on host n02 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

MPI with normal GB Etherrnet and IP networking just works fine, but the 
infinband doesn't. The MPI libs I am using
for the test are definitely compiled with IB support and the tests have 
been run successfully on
the cluster before.

Any suggestions what is going wrong here?

Best regards and thanks for any help!

Michael






More information about the general mailing list