[Users] IPoIB not working on Windows 2008 r2 - need help

Hal Rosenstock hal.rosenstock at gmail.com
Fri Jun 7 11:56:35 PDT 2013


On Fri, Jun 7, 2013 at 12:49 PM, Orion Poplawski <orion at cora.nwra.com>wrote:

> On 06/07/2013 10:09 AM, Hal Rosenstock wrote:
>
>>
>>     IPoIB is working fine among my Linux machines, I'm just trying to add
>>     Windows to the mix.  I'm running opensm 3.3.15 on SL 6.
>>
>> What's SL 6 ?
>>
>>
> ScientificLinux - RHEL clone.
>
>
>
>>     /etc/rdma/partitions:
>>     Default=0x7fff, ipoib : ALL=full ;
>>
>>
>> I have a theory as to what is going on. I think the IB port on Windows is
>> too
>> slow (IB rate * width) to join the already formed group but I thought you
>> wrote something about IPoIB link being indicated by Windows IPoIB
>> driver/net
>> device.
>>
>
> It shouldn't be.  The switch is SDR - so everything should be running at
> that rate - and ibstat reports that rate.
>
>
>
Yes, rate is fine (SDR * 4x = 10 Gbps).


>  Would you send me the output of an ibnetdiscover for your subnet ?
>>
>
>
Which is SM host ?


> #
> # Topology file: generated on Fri Jun  7 10:43:36 2013
> #
> # Initiated from node 0019bbffff005850 port 0019bbffff005851
>
> vendid=0x66a
> devid=0xb924
> sysimgguid=0x66a00d8000242
> switchguid=0x66a00d8000242(**66a00d8000242)
> Switch  24 "S-00066a00d8000242"         # "InfinIO 9024 Switch " enhanced
> port 0 lid 2 lmc 0
> [1]     "H-0005ad00000c5c3c"[1](**5ad00000c5c3d)          # "andrew
> mthca0" lid 15 4xSDR
> [6]     "H-001708ffffd09df8"[1](**1708ffffd09df9)                 #
> "alexandria2 HCA-1" lid 4 4xSDR
> [8]     "H-001708ffffd09df8"[2](**1708ffffd09dfa)                 #
> "alexandria2 HCA-1" lid 5 4xSDR
> [10]    "H-0019bbffff005850"[1](**19bbffff005851)                 # "saga
> mthca0" lid 1 4xSDR
> [11]    "H-0019bbffff003898"[2](**19bbffff00389a)                 #
> "sfcomp1 mthca0" lid 9 4xSDR
> [12]    "H-001a4bffff0c20c8"[1](**1a4bffff0c20c9)                 #
> "earth mthca0" lid 13 4xSDR
> [20]    "H-0005ad00000c5cec"[1](**5ad00000c5ced)          # "MT25204
> InfiniHostLx Mellanox Technologies" lid 16 4xSDR
> [23]    "H-0019bbffff003898"[1](**19bbffff003899)                 #
> "sfcomp1 mthca0" lid 8 4xSDR
>
> vendid=0x2c9
> devid=0x6274
> sysimgguid=0x5ad00000c5cef
> caguid=0x5ad00000c5cec
> Ca      1 "H-0005ad00000c5cec"          # "MT25204 InfiniHostLx Mellanox
> Technologies"
> [1](5ad00000c5ced)      "S-00066a00d8000242"[20]                # lid 16
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x1708
> devid=0x6278
> sysimgguid=0x1a4bffff0c20cb
> caguid=0x1a4bffff0c20c8
> Ca      2 "H-001a4bffff0c20c8"          # "earth mthca0"
> [1](1a4bffff0c20c9)     "S-00066a00d8000242"[12]                # lid 13
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x1708
> devid=0x6278
> sysimgguid=0x19bbffff00389b
> caguid=0x19bbffff003898
> Ca      2 "H-0019bbffff003898"          # "sfcomp1 mthca0"
> [1](19bbffff003899)     "S-00066a00d8000242"[23]                # lid 8
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
> [2](19bbffff00389a)     "S-00066a00d8000242"[11]                # lid 9
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x5ad
> devid=0x6274
> sysimgguid=0x5ad00000c5c3f
> caguid=0x5ad00000c5c3c
> Ca      1 "H-0005ad00000c5c3c"          # "andrew mthca0"
> [1](5ad00000c5c3d)      "S-00066a00d8000242"[1]         # lid 15 lmc 0
> "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x1708
> devid=0x6278
> sysimgguid=0x1708ffffd09dfb
> caguid=0x1708ffffd09df8
> Ca      2 "H-001708ffffd09df8"          # "alexandria2 HCA-1"
> [1](1708ffffd09df9)     "S-00066a00d8000242"[6]         # lid 4 lmc 0
> "InfinIO 9024 Switch " lid 2 4xSDR
> [2](1708ffffd09dfa)     "S-00066a00d8000242"[8]         # lid 5 lmc 0
> "InfinIO 9024 Switch " lid 2 4xSDR
>
> vendid=0x1708
> devid=0x6278
> sysimgguid=0x19bbffff005853
> caguid=0x19bbffff005850
> Ca      2 "H-0019bbffff005850"          # "saga mthca0"
> [1](19bbffff005851)     "S-00066a00d8000242"[10]                # lid 1
> lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR
>
>
>
>>
>>         You should check what saquery -m 0xc000 says after looking at
>> saquery
>>         -g to
>>         make sure that the IPoIB broadcast group (
>> ff12:401b:ffff::ffff:ffff )
>>         says
>>         MLID 0xc000.
>>
>>
>>     That is fine on my opensm machine.  I don't seem to have saquery on
>> the
>>     Windows machine.
>>
>> You can run it from the opensm machine to see if Windows machine is part
>> of
>> the IPoIB broadcast group. I want to see both group parameters for MLID
>> 0xc000
>> and whether windows port GUID (0x0005ad00000c5ced) is listed as joined in
>> that
>> group.
>>
>
> MCMemberRecord group dump:
>                 MGID....................ff12:**401b:ffff::ffff:ffff
>                 Mlid....................0xC000
>                 Mtu.....................0x84
>                 pkey....................0xFFFF
>                 Rate....................0x83
>                 SL......................0x0
>
> # saquery -m 0xc000
>                 PortGid.................fe80::**1:5:ad00:c:5c3d (Topspin
> DDR-HCAe LX x8)
>
>                 PortGid.................fe80::**1:19:bbff:ff00:5851 (saga
> mthca0)
>                 PortGid.................fe80::**1:19:bbff:ff00:3899
> (sfcomp1 mthca0)
>
>                 PortGid.................fe80::**1:1a:4bff:ff0c:20c9 (HP
> Lion Cub 128MB)
>                 PortGid.................fe80::**5:ad00:c:5ced (MT25204
> InfiniHostLx Mellanox Technologies)
>                 PortGid.................fe80::**1:17:8ff:ffd0:9df9
> (alexandria2 HCA-1)
>
>
> Seems like I may have two entries for the 5:ad00:c:5ced device?


Looks different to me than  5:ad00:c:5c3d which is Topspin one


> Perhaps updating the firmware led to that (now it is MT25204 instead of
> Topspin).
>
>
Looks like your 2 subnets are "interconnected" so they're not really 2
disjoint subnets! Is your other subnet 0xfe80::5 ? Looking at your
ibnetdiscover file, there's only 1 switch so are you running 2 SMs (one for
each subnet) over the same topology. If so, that doesn't work.

-- Hal


>
>
>  I suspect it may not be due to:
>> Jun 06 12:06:29 959894 [295C0700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B13:
>> validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX
>> x8),
>> sending IB_SA_MAD_STATUS_REQ_INVALID
>> is worrisome...
>>
>
> After having to reboot many machines to clear up the mess, I afraid to
> re-enable the IPoIB interface on the windows machine.
>
>
>
> --
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA, Boulder/CoRA Office             FAX: 303-415-9702
> 3380 Mitchell Lane                       orion at nwra.com
> Boulder, CO 80301                   http://www.nwra.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130607/7f41e390/attachment.html>


More information about the Users mailing list