[Users] IPoIB not working on Windows 2008 r2 - need help
Orion Poplawski
orion at cora.nwra.com
Fri Jun 7 09:49:49 PDT 2013
On 06/07/2013 10:09 AM, Hal Rosenstock wrote:
>
> IPoIB is working fine among my Linux machines, I'm just trying to add
> Windows to the mix. I'm running opensm 3.3.15 on SL 6.
>
> What's SL 6 ?
>
ScientificLinux - RHEL clone.
>
> /etc/rdma/partitions:
> Default=0x7fff, ipoib : ALL=full ;
>
>
> I have a theory as to what is going on. I think the IB port on Windows is too
> slow (IB rate * width) to join the already formed group but I thought you
> wrote something about IPoIB link being indicated by Windows IPoIB driver/net
> device.
It shouldn't be. The switch is SDR - so everything should be running at that
rate - and ibstat reports that rate.
> Would you send me the output of an ibnetdiscover for your subnet ?
#
# Topology file: generated on Fri Jun 7 10:43:36 2013
#
# Initiated from node 0019bbffff005850 port 0019bbffff005851
vendid=0x66a
devid=0xb924
sysimgguid=0x66a00d8000242
switchguid=0x66a00d8000242(66a00d8000242)
Switch 24 "S-00066a00d8000242" # "InfinIO 9024 Switch " enhanced port
0 lid 2 lmc 0
[1] "H-0005ad00000c5c3c"[1](5ad00000c5c3d) # "andrew mthca0" lid
15 4xSDR
[6] "H-001708ffffd09df8"[1](1708ffffd09df9) # "alexandria2
HCA-1" lid 4 4xSDR
[8] "H-001708ffffd09df8"[2](1708ffffd09dfa) # "alexandria2
HCA-1" lid 5 4xSDR
[10] "H-0019bbffff005850"[1](19bbffff005851) # "saga
mthca0" lid 1 4xSDR
[11] "H-0019bbffff003898"[2](19bbffff00389a) # "sfcomp1
mthca0" lid 9 4xSDR
[12] "H-001a4bffff0c20c8"[1](1a4bffff0c20c9) # "earth
mthca0" lid 13 4xSDR
[20] "H-0005ad00000c5cec"[1](5ad00000c5ced) # "MT25204
InfiniHostLx Mellanox Technologies" lid 16 4xSDR
[23] "H-0019bbffff003898"[1](19bbffff003899) # "sfcomp1
mthca0" lid 8 4xSDR
vendid=0x2c9
devid=0x6274
sysimgguid=0x5ad00000c5cef
caguid=0x5ad00000c5cec
Ca 1 "H-0005ad00000c5cec" # "MT25204 InfiniHostLx Mellanox
Technologies"
[1](5ad00000c5ced) "S-00066a00d8000242"[20] # lid 16 lmc 0
"InfinIO 9024 Switch " lid 2 4xSDR
vendid=0x1708
devid=0x6278
sysimgguid=0x1a4bffff0c20cb
caguid=0x1a4bffff0c20c8
Ca 2 "H-001a4bffff0c20c8" # "earth mthca0"
[1](1a4bffff0c20c9) "S-00066a00d8000242"[12] # lid 13 lmc 0
"InfinIO 9024 Switch " lid 2 4xSDR
vendid=0x1708
devid=0x6278
sysimgguid=0x19bbffff00389b
caguid=0x19bbffff003898
Ca 2 "H-0019bbffff003898" # "sfcomp1 mthca0"
[1](19bbffff003899) "S-00066a00d8000242"[23] # lid 8 lmc 0
"InfinIO 9024 Switch " lid 2 4xSDR
[2](19bbffff00389a) "S-00066a00d8000242"[11] # lid 9 lmc 0
"InfinIO 9024 Switch " lid 2 4xSDR
vendid=0x5ad
devid=0x6274
sysimgguid=0x5ad00000c5c3f
caguid=0x5ad00000c5c3c
Ca 1 "H-0005ad00000c5c3c" # "andrew mthca0"
[1](5ad00000c5c3d) "S-00066a00d8000242"[1] # lid 15 lmc 0
"InfinIO 9024 Switch " lid 2 4xSDR
vendid=0x1708
devid=0x6278
sysimgguid=0x1708ffffd09dfb
caguid=0x1708ffffd09df8
Ca 2 "H-001708ffffd09df8" # "alexandria2 HCA-1"
[1](1708ffffd09df9) "S-00066a00d8000242"[6] # lid 4 lmc 0 "InfinIO
9024 Switch " lid 2 4xSDR
[2](1708ffffd09dfa) "S-00066a00d8000242"[8] # lid 5 lmc 0 "InfinIO
9024 Switch " lid 2 4xSDR
vendid=0x1708
devid=0x6278
sysimgguid=0x19bbffff005853
caguid=0x19bbffff005850
Ca 2 "H-0019bbffff005850" # "saga mthca0"
[1](19bbffff005851) "S-00066a00d8000242"[10] # lid 1 lmc 0
"InfinIO 9024 Switch " lid 2 4xSDR
>
>
> You should check what saquery -m 0xc000 says after looking at saquery
> -g to
> make sure that the IPoIB broadcast group ( ff12:401b:ffff::ffff:ffff )
> says
> MLID 0xc000.
>
>
> That is fine on my opensm machine. I don't seem to have saquery on the
> Windows machine.
>
> You can run it from the opensm machine to see if Windows machine is part of
> the IPoIB broadcast group. I want to see both group parameters for MLID 0xc000
> and whether windows port GUID (0x0005ad00000c5ced) is listed as joined in that
> group.
MCMemberRecord group dump:
MGID....................ff12:401b:ffff::ffff:ffff
Mlid....................0xC000
Mtu.....................0x84
pkey....................0xFFFF
Rate....................0x83
SL......................0x0
# saquery -m 0xc000
PortGid.................fe80::1:5:ad00:c:5c3d (Topspin
DDR-HCAe LX x8)
PortGid.................fe80::1:19:bbff:ff00:5851 (saga mthca0)
PortGid.................fe80::1:19:bbff:ff00:3899 (sfcomp1
mthca0)
PortGid.................fe80::1:1a:4bff:ff0c:20c9 (HP Lion
Cub 128MB)
PortGid.................fe80::5:ad00:c:5ced (MT25204
InfiniHostLx Mellanox Technologies)
PortGid.................fe80::1:17:8ff:ffd0:9df9 (alexandria2
HCA-1)
Seems like I may have two entries for the 5:ad00:c:5ced device? Perhaps
updating the firmware led to that (now it is MT25204 instead of Topspin).
> I suspect it may not be due to:
> Jun 06 12:06:29 959894 [295C0700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B13:
> validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX x8),
> sending IB_SA_MAD_STATUS_REQ_INVALID
> is worrisome...
After having to reboot many machines to clear up the mess, I afraid to
re-enable the IPoIB interface on the windows machine.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane orion at nwra.com
Boulder, CO 80301 http://www.nwra.com
More information about the Users
mailing list