[ewg] mlx4 and ibv_devinfo discrepancy?

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Tue Jul 7 09:48:55 PDT 2009


I was attempting to debug an IPoIB "multicast join failed" issue and in the process
discovered the discrepancy (was using OFED-1.4.1 on ppc64 blades) as described below.

My setup consisted of two nodes with dual port ConnectX HCAs with ports 1 on each node 
connected to a switch say switch1 and ports 2 on each node connected to another switch, say switch2.

The problem was that ports 1 would not join the multicast group as shown

ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22

If the same ports were connected to switch2, using the same cables, everything
worked fine. The problem was due to an MTU mismatch, so IPoIB did behave as expected.

However, as shown the output of ibv_devinfo was misleading. This was the output when
the port 1 was connected to switch1 with incorrect MTU.

[root at cluster-1 ~]# ibv_devinfo
hca_id: mlx4_0
        fw_ver:                         2.6.000
        node_guid:                      0002:c903:0001:2058
        sys_image_guid:                 0002:c903:0001:205b
        vendor_id:                      0x02c9
        vendor_part_id:                 25418
        hw_ver:                         0xA0
        board_id:                       IBM08A0000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               50
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               11
                        port_lmc:               0x00


Same issue with the other HCA too.

[root at cluster-2 ~]# ibv_devinfo
hca_id: mlx4_0
        fw_ver:                         2.6.000
        node_guid:                      0002:c903:0001:21e4
        sys_image_guid:                 0002:c903:0001:21e7
        vendor_id:                      0x02c9
        vendor_part_id:                 25418
        hw_ver:                         0xA0
        board_id:                       IBM08A0000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               51
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               66
                        port_lmc:               0x00

[root at cluster-2 ~]#


"cat /sys/class/net/ib0/operstate" showed down and that clued me to the fact that there was something amiss and
as shown the link was down.

[root at cluster-1 ~]# ip link show dev ib0
3: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 65520 qdisc pfifo_fast qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:01:20:59 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
[root at cluster-1 ~]# ip link show dev ib1
4: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast qlen 256
    link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:01:20:5a brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
[root at cluster-1 ~]#

[root at cluster-2 ~]# ip link show dev ib0
3: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 65520 qdisc pfifo_fast qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:01:21:e5 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
[root at cluster-2 ~]# ip link show dev ib1
4: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 256
    link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:01:21:e6 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
[root at cluster-2 ~]#



Why does ibv_devinfo show the port 1 as PORT_ACTIVE? Isn't that incorrect? Is this a known problem?

Pradeep




More information about the ewg mailing list