[Users] IPoIB not working on Windows 2008 r2 - need help

Hal Rosenstock hal.rosenstock at gmail.com
Fri Jun 7 03:31:39 PDT 2013


On Thu, Jun 6, 2013 at 3:32 PM, Orion Poplawski <orion at cora.nwra.com> wrote:

> I'm trying for the first time to get IPoIB working on one of our Windows
> servers.  The network is working fine between some Linux machines.  Details:
>
> InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
> Windows Server 2008 r2
> MLNX_WinOF_VPI_2_1_2_win7_x64.**msi (as recommended by the mellanox
> download page for InfiniHost III adapters)
>
> I don't notice any errors, the adapter shows up fine and I can configure
> it with a static IP address.  After configuring it (or after boot) I can
> ping it from another machine for about 10 seconds before it stops
> responding.  When I ping out from the machine at this point, the icmp
> packets are being sent out the main ethernet interface (which is a
> different IP network) and I can see them get to our router.  ibdiagnet does
> not report any errors.  ipconfig and netstart -r seem fine.
>
> I see the following in my opensm log:
>
> Jun 06 11:51:50 600282 [29FC1700] 0x02 -> log_notice: Reporting Generic
> Notice type:1 num:128 (Link state change) from LID:2
> GID:fe80::1:6:6a00:d800:242
> Jun 06 11:51:51 011771 [1F5B0700] 0x02 -> osm_ucast_mgr_process: minhop
> tables configured on all switches
> Jun 06 11:51:51 016889 [1F5B0700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::1:5:ad00:c:5ced
> Jun 06 11:51:51 016899 [1F5B0700] 0x02 -> state_mgr_report_new_ports:
> Discovered new port with GUID:0x0005ad00000c5ced LID range [16,16] of node:
> Topspin DDR-HCAe LX x8
> Jun 06 11:51:51 027491 [1F5B0700] 0x02 -> SUBNET UP
> Jun 06 11:51:56 333829 [213B3700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:1
> GID:ff12:1405:ffff::3333:0:1
> Jun 06 11:51:56 333875 [209B2700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:1
> GID:ff12:1405:ffff::3333:ff76:**9ac6
> Jun 06 11:51:56 603980 [295C0700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:1
> GID:ff12:1405:ffff::3333:0:2
> Jun 06 11:51:56 604270 [245B8700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:1
> GID:ff12:1405:ffff::3333:0:16
> Jun 06 11:52:15 854497 [263BB700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:1
> GID:ff12:1405:ffff::3333:1:2
> Jun 06 11:52:15 857261 [213B3700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:1
> GID:ff12:1405:ffff::3333:1:3
> Jun 06 11:52:15 857968 [209B2700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:1
> GID:ff12:401b:ffff::16
> Jun 06 11:52:15 963577 [21DB4700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:1
> GID:ff12:401b:ffff::fc
> Jun 06 12:04:56 535293 [26DBC700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:1405:ffff::3333:0:1 for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:04:56 535870 [277BD700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:1405:ffff::3333:1:3 for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:04:56 535908 [259BA700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:1405:ffff::3333:1:2 for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:04:56 535942 [23BB7700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:401b:ffff::fc for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:04:56 535970 [281BE700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:401b:ffff::16 for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:04:56 536014 [277BD700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:1405:ffff::3333:ff76:9ac6
> for PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:04:56 536042 [209B2700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:1405:ffff::3333:0:16 for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:04:56 536634 [227B5700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:1405:ffff::3333:0:2 for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:06:29 959894 [295C0700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B13:
> validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX
> x8), sending IB_SA_MAD_STATUS_REQ_INVALID
> Jun 06 12:06:29 960518 [231B6700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:1405:ffff::3333:1:2 for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:06:36 629355 [26DBC700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:401b:ffff::1 for PortGID:
> fe80::5:ad00:c:5ced
> Jun 06 12:06:36 629416 [259BA700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:401b:ffff::ffff:ffff for
> PortGID: fe80::5:ad00:c:5ced
> Jun 06 12:06:36 638659 [21DB4700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B13:
> validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX
> x8), sending IB_SA_MAD_STATUS_REQ_INVALID
> Jun 06 12:06:36 638853 [245B8700] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request for MGID: ff12:401b:ffff::ffff:ffff for
> PortGID: fe80::5:ad00:c:5ced
>
> This last message repeats quite a bit within that second, and then stops.
>

IPoIB in Windows deletes from the IPoIB broadcast IB multicast group before
joining so if that port isn't a member of that MC group you will see this
so these aren't necessarily "bad" from an SM perspective.

How is your partition file for OpenSM setup ? You should have the ipoib
flag on for the default partition.

Which OpenSM are you using here ? A Windows or Linux node ? Which version ?

You should check what saquery -m 0xc000 says after looking at saquery -g to
make sure that the IPoIB broadcast group ( ff12:401b:ffff::ffff:ffff ) says
MLID 0xc000.

Also, is the HCA really 8x DDR as the NodeDescription appears (Topspin
DDR-HCAe LX x8) ?

-- Hal


>
> Any ideas?
>
> --
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA, Boulder/CoRA Office             FAX: 303-415-9702
> 3380 Mitchell Lane                       orion at nwra.com
> Boulder, CO 80301                   http://www.nwra.com
> ______________________________**_________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/**cgi-bin/mailman/listinfo/users<http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130607/d3716248/attachment.html>


More information about the Users mailing list