[Users] IPoIB not working on Windows 2008 r2 - need help

Orion Poplawski orion at cora.nwra.com
Fri Jun 7 09:49:49 PDT 2013


On 06/07/2013 10:09 AM, Hal Rosenstock wrote:
>
>     IPoIB is working fine among my Linux machines, I'm just trying to add
>     Windows to the mix.  I'm running opensm 3.3.15 on SL 6.
>
> What's SL 6 ?
>

ScientificLinux - RHEL clone.

>
>     /etc/rdma/partitions:
>     Default=0x7fff, ipoib : ALL=full ;
>
>
> I have a theory as to what is going on. I think the IB port on Windows is too
> slow (IB rate * width) to join the already formed group but I thought you
> wrote something about IPoIB link being indicated by Windows IPoIB driver/net
> device.

It shouldn't be.  The switch is SDR - so everything should be running at that 
rate - and ibstat reports that rate.

> Would you send me the output of an ibnetdiscover for your subnet ?

#
# Topology file: generated on Fri Jun  7 10:43:36 2013
#
# Initiated from node 0019bbffff005850 port 0019bbffff005851

vendid=0x66a
devid=0xb924
sysimgguid=0x66a00d8000242
switchguid=0x66a00d8000242(66a00d8000242)
Switch  24 "S-00066a00d8000242"         # "InfinIO 9024 Switch " enhanced port 
0 lid 2 lmc 0
[1]     "H-0005ad00000c5c3c"[1](5ad00000c5c3d)          # "andrew mthca0" lid 
15 4xSDR
[6]     "H-001708ffffd09df8"[1](1708ffffd09df9)                 # "alexandria2 
HCA-1" lid 4 4xSDR
[8]     "H-001708ffffd09df8"[2](1708ffffd09dfa)                 # "alexandria2 
HCA-1" lid 5 4xSDR
[10]    "H-0019bbffff005850"[1](19bbffff005851)                 # "saga 
mthca0" lid 1 4xSDR
[11]    "H-0019bbffff003898"[2](19bbffff00389a)                 # "sfcomp1 
mthca0" lid 9 4xSDR
[12]    "H-001a4bffff0c20c8"[1](1a4bffff0c20c9)                 # "earth 
mthca0" lid 13 4xSDR
[20]    "H-0005ad00000c5cec"[1](5ad00000c5ced)          # "MT25204 
InfiniHostLx Mellanox Technologies" lid 16 4xSDR
[23]    "H-0019bbffff003898"[1](19bbffff003899)                 # "sfcomp1 
mthca0" lid 8 4xSDR

vendid=0x2c9
devid=0x6274
sysimgguid=0x5ad00000c5cef
caguid=0x5ad00000c5cec
Ca      1 "H-0005ad00000c5cec"          # "MT25204 InfiniHostLx Mellanox 
Technologies"
[1](5ad00000c5ced)      "S-00066a00d8000242"[20]                # lid 16 lmc 0 
"InfinIO 9024 Switch " lid 2 4xSDR

vendid=0x1708
devid=0x6278
sysimgguid=0x1a4bffff0c20cb
caguid=0x1a4bffff0c20c8
Ca      2 "H-001a4bffff0c20c8"          # "earth mthca0"
[1](1a4bffff0c20c9)     "S-00066a00d8000242"[12]                # lid 13 lmc 0 
"InfinIO 9024 Switch " lid 2 4xSDR

vendid=0x1708
devid=0x6278
sysimgguid=0x19bbffff00389b
caguid=0x19bbffff003898
Ca      2 "H-0019bbffff003898"          # "sfcomp1 mthca0"
[1](19bbffff003899)     "S-00066a00d8000242"[23]                # lid 8 lmc 0 
"InfinIO 9024 Switch " lid 2 4xSDR
[2](19bbffff00389a)     "S-00066a00d8000242"[11]                # lid 9 lmc 0 
"InfinIO 9024 Switch " lid 2 4xSDR

vendid=0x5ad
devid=0x6274
sysimgguid=0x5ad00000c5c3f
caguid=0x5ad00000c5c3c
Ca      1 "H-0005ad00000c5c3c"          # "andrew mthca0"
[1](5ad00000c5c3d)      "S-00066a00d8000242"[1]         # lid 15 lmc 0 
"InfinIO 9024 Switch " lid 2 4xSDR

vendid=0x1708
devid=0x6278
sysimgguid=0x1708ffffd09dfb
caguid=0x1708ffffd09df8
Ca      2 "H-001708ffffd09df8"          # "alexandria2 HCA-1"
[1](1708ffffd09df9)     "S-00066a00d8000242"[6]         # lid 4 lmc 0 "InfinIO 
9024 Switch " lid 2 4xSDR
[2](1708ffffd09dfa)     "S-00066a00d8000242"[8]         # lid 5 lmc 0 "InfinIO 
9024 Switch " lid 2 4xSDR

vendid=0x1708
devid=0x6278
sysimgguid=0x19bbffff005853
caguid=0x19bbffff005850
Ca      2 "H-0019bbffff005850"          # "saga mthca0"
[1](19bbffff005851)     "S-00066a00d8000242"[10]                # lid 1 lmc 0 
"InfinIO 9024 Switch " lid 2 4xSDR

>
>
>         You should check what saquery -m 0xc000 says after looking at saquery
>         -g to
>         make sure that the IPoIB broadcast group ( ff12:401b:ffff::ffff:ffff )
>         says
>         MLID 0xc000.
>
>
>     That is fine on my opensm machine.  I don't seem to have saquery on the
>     Windows machine.
>
> You can run it from the opensm machine to see if Windows machine is part of
> the IPoIB broadcast group. I want to see both group parameters for MLID 0xc000
> and whether windows port GUID (0x0005ad00000c5ced) is listed as joined in that
> group.

MCMemberRecord group dump:
                 MGID....................ff12:401b:ffff::ffff:ffff
                 Mlid....................0xC000
                 Mtu.....................0x84
                 pkey....................0xFFFF
                 Rate....................0x83
                 SL......................0x0

# saquery -m 0xc000
                 PortGid.................fe80::1:5:ad00:c:5c3d (Topspin 
DDR-HCAe LX x8)
                 PortGid.................fe80::1:19:bbff:ff00:5851 (saga mthca0)
                 PortGid.................fe80::1:19:bbff:ff00:3899 (sfcomp1 
mthca0)
                 PortGid.................fe80::1:1a:4bff:ff0c:20c9 (HP Lion 
Cub 128MB)
                 PortGid.................fe80::5:ad00:c:5ced (MT25204 
InfiniHostLx Mellanox Technologies)
                 PortGid.................fe80::1:17:8ff:ffd0:9df9 (alexandria2 
HCA-1)


Seems like I may have two entries for the 5:ad00:c:5ced device?  Perhaps 
updating the firmware led to that (now it is MT25204 instead of Topspin).


> I suspect it may not be due to:
> Jun 06 12:06:29 959894 [295C0700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B13:
> validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX x8),
> sending IB_SA_MAD_STATUS_REQ_INVALID
> is worrisome...

After having to reboot many machines to clear up the mess, I afraid to 
re-enable the IPoIB interface on the windows machine.


-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                   http://www.nwra.com



More information about the Users mailing list