[ofa-general] IBCM and ordering of OpenFabrics devices
Jeff Squyres
jsquyres at cisco.com
Fri Jul 18 14:23:23 PDT 2008
(re-post from http://www.open-mpi.org/community/lists/devel/2008/07/4371.php
; I got no reply on the OMPI mailing list)
I have a case where ib_cm_open_device() is failing for an odd reason:
I have 12 servers that contain both HCAs and iWARP NICs. In most
cases, everything is fine. But one one of these servers, IBCM refuses
to work -- ib_cm_open_device() fails with the following:
libibcm: unable to open /dev/infiniband/ucm1
Looking closer, this device does, indeed, exist:
[4:01] svbu-mpi044:~/mpi % ls -l /dev/infiniband/ucm*
crw-rw-rw- 1 root root 231, 224 Jul 16 04:30 /dev/infiniband/ucm0
crw-rw-rw- 1 root root 231, 225 Jul 16 04:30 /dev/infiniband/ucm1
[4:08] svbu-mpi044:~/mpi %
Granted; I had to create these devices manually because they are not
created automatically for me upon boot in RHEL4U4 and U6. These device
major/minor numbers work fine for me on all my other servers.
So what's different between the 11 machines that work and the 1 that
doesn't? It seems that the kernel ordering of devices is what is
different. On most of the machines:
[4:10] svbu-mpi045:~ % ibv_devinfo
hca_id: mlx4_0
fw_ver: 2.3.000
node_guid: 0002:c903:0000:036c
sys_image_guid: 0002:c903:0000:036f
vendor_id: 0x02c9
vendor_part_id: 25418
hw_ver: 0xA0
board_id: MT_04A0110002
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 7
port_lmc: 0x00
port: 2
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 52
port_lmc: 0x00
hca_id: nes0
node_guid: 0012:5502:63c0:0000
sys_image_guid: 0012:5502:63c0:0000
vendor_id: 0x0000
vendor_part_id: 0
hw_ver: 0x5
board_id: NES020 Board ID
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 1
port_lmc: 0x00
But on this one problematic machine:
4:10] svbu-mpi044:~/mpi % ibv_devinfo
hca_id: nes0
node_guid: 0012:5502:63b8:0000
sys_image_guid: 0012:5502:63b8:0000
vendor_id: 0x0000
vendor_part_id: 0
hw_ver: 0x5
board_id: NES020 Board ID
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 1
port_lmc: 0x00
hca_id: mlx4_0
fw_ver: 2.3.000
node_guid: 0002:c903:0000:03b0
sys_image_guid: 0002:c903:0000:03b3
vendor_id: 0x02c9
vendor_part_id: 25418
hw_ver: 0xA0
board_id: MT_04A0110002
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 6
port_lmc: 0x00
port: 2
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 136
port_lmc: 0x00
Notice that the ordering is different.
So I'm not enough of a kernel guy to know where the problem is:
1. Technically, mlx4_0 is the first IB device. Should it therefore be
using ucm0? I.e., is libibcm wrong for trying to use ucm1? (note that
OMPI's openib BTL is currently replicating the logic from libibcm to
check for the Right ucm* file so that we can silently fail before
ib_cm_open_device() fails with a warning message -- so if libibcm's
logic to find the Right ucm* file changes, we'll also need to change
MPI's logic to mirror it. OMPI's logic becomes moot in newer libibcm
versions where Sean removed the warning message, though).
2. Or are my major/minor numbers incorrect for the devices that I
created manually? If the major/minor device numbers were created by
the OS upon bootup (as they should be -- there's an open OpenFabrics
bugzilla ticket about this), would they be correct?
--
Jeff Squyres
Cisco Systems
More information about the general
mailing list