[Users] IPoIB not working on Windows 2008 r2 - need help
Hal Rosenstock
hal.rosenstock at gmail.com
Fri Jun 7 11:56:35 PDT 2013
On Fri, Jun 7, 2013 at 12:49 PM, Orion Poplawski <orion at cora.nwra.com>wrote:
> On 06/07/2013 10:09 AM, Hal Rosenstock wrote:
>
>>
>> IPoIB is working fine among my Linux machines, I'm just trying to add
>> Windows to the mix. I'm running opensm 3.3.15 on SL 6.
>>
>> What's SL 6 ?
>>
>>
> ScientificLinux - RHEL clone.
>
>
>
>> /etc/rdma/partitions:
>> Default=0x7fff, ipoib : ALL=full ;
>>
>>
>> I have a theory as to what is going on. I think the IB port on Windows is
>> too
>> slow (IB rate * width) to join the already formed group but I thought you
>> wrote something about IPoIB link being indicated by Windows IPoIB
>> driver/net
>> device.
>>
>
> It shouldn't be. The switch is SDR - so everything should be running at
> that rate - and ibstat reports that rate.
>
>
>
Yes, rate is fine (SDR * 4x = 10 Gbps).
> Would you send me the output of an ibnetdiscover for your subnet ?
>>
>
>
Which is SM host ?
> #
> # Topology file: generated on Fri Jun 7 10:43:36 2013
> #
> # Initiated from node 0019bbffff005850 port 0019bbffff005851
>
> vendid=0x66a
> devid=0xb924
> sysimgguid=0x66a00d8000242
> switchguid=0x66a00d8000242(**66a00d8000242)
> Switch 24 "S-00066a00d8000242" # "InfinIO 9024 Switch " enhanced
> port 0 lid 2 lmc 0
> [1] "H-0005ad00000c5c3c"[1](**5ad00000c5c3d) # "andrew
> mthca0" lid 15 4xSDR
> [6] "H-001708ffffd09df8"[1](**1708ffffd09df9) #
> "alexandria2 HCA-1" lid 4 4xSDR
> [8] "H-001708ffffd09df8"[2](**1708ffffd09dfa) #
> "alexandria2 HCA-1" lid 5 4xSDR
> [10] "H-0019bbffff005850"[1](**19bbffff005851) # "saga
> mthca0" lid 1 4xSDR
> [11] "H-0019bbffff003898"[2](**19bbffff00389a) #
> "sfcomp1 mthca0" lid 9 4xSDR
> [12] "H-001a4bffff0c20c8"[1](**1a4bffff0c20c9) #
> "earth mthca0" lid 13 4xSDR
> [20] "H-0005ad00000c5cec"[1](**5ad00000c5ced) # "MT25204
> InfiniHostLx Mellanox Technologies" lid 16 4xSDR
> [23] "H-0019bbffff003898"[1](**19bbffff003899) #
> "sfcomp1 mthca0" lid 8 4xSDR
>
> vendid=0x2c9
> devid=0x6274
> sysimgguid=0x5ad00000c5cef
> caguid=0x5ad00000c5cec
> Ca 1 "H-0005ad00000c5cec" # "MT25204 InfiniHostLx Mellanox
> Technologies"
> [1](5ad00000c5ced) "S-00066a00d8000242"[20] # lid 16
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x1708
> devid=0x6278
> sysimgguid=0x1a4bffff0c20cb
> caguid=0x1a4bffff0c20c8
> Ca 2 "H-001a4bffff0c20c8" # "earth mthca0"
> [1](1a4bffff0c20c9) "S-00066a00d8000242"[12] # lid 13
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x1708
> devid=0x6278
> sysimgguid=0x19bbffff00389b
> caguid=0x19bbffff003898
> Ca 2 "H-0019bbffff003898" # "sfcomp1 mthca0"
> [1](19bbffff003899) "S-00066a00d8000242"[23] # lid 8
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
> [2](19bbffff00389a) "S-00066a00d8000242"[11] # lid 9
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x5ad
> devid=0x6274
> sysimgguid=0x5ad00000c5c3f
> caguid=0x5ad00000c5c3c
> Ca 1 "H-0005ad00000c5c3c" # "andrew mthca0"
> [1](5ad00000c5c3d) "S-00066a00d8000242"[1] # lid 15 lmc 0
> "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x1708
> devid=0x6278
> sysimgguid=0x1708ffffd09dfb
> caguid=0x1708ffffd09df8
> Ca 2 "H-001708ffffd09df8" # "alexandria2 HCA-1"
> [1](1708ffffd09df9) "S-00066a00d8000242"[6] # lid 4 lmc 0
> "InfinIO 9024 Switch " lid 2 4xSDR
> [2](1708ffffd09dfa) "S-00066a00d8000242"[8] # lid 5 lmc 0
> "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x1708
> devid=0x6278
> sysimgguid=0x19bbffff005853
> caguid=0x19bbffff005850
> Ca 2 "H-0019bbffff005850" # "saga mthca0"
> [1](19bbffff005851) "S-00066a00d8000242"[10] # lid 1
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
>
>
>
>>
>> You should check what saquery -m 0xc000 says after looking at
>> saquery
>> -g to
>> make sure that the IPoIB broadcast group (
>> ff12:401b:ffff::ffff:ffff )
>> says
>> MLID 0xc000.
>>
>>
>> That is fine on my opensm machine. I don't seem to have saquery on
>> the
>> Windows machine.
>>
>> You can run it from the opensm machine to see if Windows machine is part
>> of
>> the IPoIB broadcast group. I want to see both group parameters for MLID
>> 0xc000
>> and whether windows port GUID (0x0005ad00000c5ced) is listed as joined in
>> that
>> group.
>>
>
> MCMemberRecord group dump:
> MGID....................ff12:**401b:ffff::ffff:ffff
> Mlid....................0xC000
> Mtu.....................0x84
> pkey....................0xFFFF
> Rate....................0x83
> SL......................0x0
>
> # saquery -m 0xc000
> PortGid.................fe80::**1:5:ad00:c:5c3d (Topspin
> DDR-HCAe LX x8)
>
> PortGid.................fe80::**1:19:bbff:ff00:5851 (saga
> mthca0)
> PortGid.................fe80::**1:19:bbff:ff00:3899
> (sfcomp1 mthca0)
>
> PortGid.................fe80::**1:1a:4bff:ff0c:20c9 (HP
> Lion Cub 128MB)
> PortGid.................fe80::**5:ad00:c:5ced (MT25204
> InfiniHostLx Mellanox Technologies)
> PortGid.................fe80::**1:17:8ff:ffd0:9df9
> (alexandria2 HCA-1)
>
>
> Seems like I may have two entries for the 5:ad00:c:5ced device?
Looks different to me than 5:ad00:c:5c3d which is Topspin one
> Perhaps updating the firmware led to that (now it is MT25204 instead of
> Topspin).
>
>
Looks like your 2 subnets are "interconnected" so they're not really 2
disjoint subnets! Is your other subnet 0xfe80::5 ? Looking at your
ibnetdiscover file, there's only 1 switch so are you running 2 SMs (one for
each subnet) over the same topology. If so, that doesn't work.
-- Hal
>
>
> I suspect it may not be due to:
>> Jun 06 12:06:29 959894 [295C0700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B13:
>> validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX
>> x8),
>> sending IB_SA_MAD_STATUS_REQ_INVALID
>> is worrisome...
>>
>
> After having to reboot many machines to clear up the mess, I afraid to
> re-enable the IPoIB interface on the windows machine.
>
>
>
> --
> Orion Poplawski
> Technical Manager 303-415-9701 x222
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane orion at nwra.com
> Boulder, CO 80301 http://www.nwra.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130607/7f41e390/attachment.html>
More information about the Users
mailing list