<br><br><div class="gmail_quote">On Fri, Jun 7, 2013 at 12:49 PM, Orion Poplawski <span dir="ltr"><<a href="mailto:orion@cora.nwra.com" target="_blank">orion@cora.nwra.com</a>></span> wrote:<br><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<div class="im">On 06/07/2013 10:09 AM, Hal Rosenstock wrote:<br>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<br>
IPoIB is working fine among my Linux machines, I'm just trying to add<br>
Windows to the mix. I'm running opensm 3.3.15 on SL 6.<br>
<br>
What's SL 6 ?<br>
<br>
</blockquote>
<br></div>
ScientificLinux - RHEL clone.<div class="im"><br>
<br>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<br>
/etc/rdma/partitions:<br>
Default=0x7fff, ipoib : ALL=full ;<br>
<br>
<br>
I have a theory as to what is going on. I think the IB port on Windows is too<br>
slow (IB rate * width) to join the already formed group but I thought you<br>
wrote something about IPoIB link being indicated by Windows IPoIB driver/net<br>
device.<br>
</blockquote>
<br></div>
It shouldn't be. The switch is SDR - so everything should be running at that rate - and ibstat reports that rate.<div class="im"><br>
<br></div></blockquote><div> </div><div>Yes, rate is fine (SDR * 4x = 10 Gbps).</div><div> </div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<div class="im">
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
Would you send me the output of an ibnetdiscover for your subnet ?<br>
</blockquote>
<br></div></blockquote><div> </div><div>Which is SM host ?</div><div> </div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<div class="im"></div>
#<br>
# Topology file: generated on Fri Jun 7 10:43:36 2013<br>
#<br>
# Initiated from node 0019bbffff005850 port 0019bbffff005851<br>
<br>
vendid=0x66a<br>
devid=0xb924<br>
sysimgguid=0x66a00d8000242<br>
switchguid=0x66a00d8000242(<u></u>66a00d8000242)<br>
Switch 24 "S-00066a00d8000242" # "InfinIO 9024 Switch " enhanced port 0 lid 2 lmc 0<br>
[1] "H-0005ad00000c5c3c"[1](<u></u>5ad00000c5c3d) # "andrew mthca0" lid 15 4xSDR<br>
[6] "H-001708ffffd09df8"[1](<u></u>1708ffffd09df9) # "alexandria2 HCA-1" lid 4 4xSDR<br>
[8] "H-001708ffffd09df8"[2](<u></u>1708ffffd09dfa) # "alexandria2 HCA-1" lid 5 4xSDR<br>
[10] "H-0019bbffff005850"[1](<u></u>19bbffff005851) # "saga mthca0" lid 1 4xSDR<br>
[11] "H-0019bbffff003898"[2](<u></u>19bbffff00389a) # "sfcomp1 mthca0" lid 9 4xSDR<br>
[12] "H-001a4bffff0c20c8"[1](<u></u>1a4bffff0c20c9) # "earth mthca0" lid 13 4xSDR<br>
[20] "H-0005ad00000c5cec"[1](<u></u>5ad00000c5ced) # "MT25204 InfiniHostLx Mellanox Technologies" lid 16 4xSDR<br>
[23] "H-0019bbffff003898"[1](<u></u>19bbffff003899) # "sfcomp1 mthca0" lid 8 4xSDR<br>
<br>
vendid=0x2c9<br>
devid=0x6274<br>
sysimgguid=0x5ad00000c5cef<br>
caguid=0x5ad00000c5cec<br>
Ca 1 "H-0005ad00000c5cec" # "MT25204 InfiniHostLx Mellanox Technologies"<br>
[1](5ad00000c5ced) "S-00066a00d8000242"[20] # lid 16 lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR<br>
<br>
vendid=0x1708<br>
devid=0x6278<br>
sysimgguid=0x1a4bffff0c20cb<br>
caguid=0x1a4bffff0c20c8<br>
Ca 2 "H-001a4bffff0c20c8" # "earth mthca0"<br>
[1](1a4bffff0c20c9) "S-00066a00d8000242"[12] # lid 13 lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR<br>
<br>
vendid=0x1708<br>
devid=0x6278<br>
sysimgguid=0x19bbffff00389b<br>
caguid=0x19bbffff003898<br>
Ca 2 "H-0019bbffff003898" # "sfcomp1 mthca0"<br>
[1](19bbffff003899) "S-00066a00d8000242"[23] # lid 8 lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR<br>
[2](19bbffff00389a) "S-00066a00d8000242"[11] # lid 9 lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR<br>
<br>
vendid=0x5ad<br>
devid=0x6274<br>
sysimgguid=0x5ad00000c5c3f<br>
caguid=0x5ad00000c5c3c<br>
Ca 1 "H-0005ad00000c5c3c" # "andrew mthca0"<br>
[1](5ad00000c5c3d) "S-00066a00d8000242"[1] # lid 15 lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR<br>
<br>
vendid=0x1708<br>
devid=0x6278<br>
sysimgguid=0x1708ffffd09dfb<br>
caguid=0x1708ffffd09df8<br>
Ca 2 "H-001708ffffd09df8" # "alexandria2 HCA-1"<br>
[1](1708ffffd09df9) "S-00066a00d8000242"[6] # lid 4 lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR<br>
[2](1708ffffd09dfa) "S-00066a00d8000242"[8] # lid 5 lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR<br>
<br>
vendid=0x1708<br>
devid=0x6278<br>
sysimgguid=0x19bbffff005853<br>
caguid=0x19bbffff005850<br>
Ca 2 "H-0019bbffff005850" # "saga mthca0"<br>
[1](19bbffff005851) "S-00066a00d8000242"[10] # lid 1 lmc 0 "InfinIO 9024 Switch " lid 2 4xSDR<div class="im"><br>
<br>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<br>
<br>
You should check what saquery -m 0xc000 says after looking at saquery<br>
-g to<br>
make sure that the IPoIB broadcast group ( ff12:401b:ffff::ffff:ffff )<br>
says<br>
MLID 0xc000.<br>
<br>
<br>
That is fine on my opensm machine. I don't seem to have saquery on the<br>
Windows machine.<br>
<br>
You can run it from the opensm machine to see if Windows machine is part of<br>
the IPoIB broadcast group. I want to see both group parameters for MLID 0xc000<br>
and whether windows port GUID (0x0005ad00000c5ced) is listed as joined in that<br>
group.<br>
</blockquote>
<br></div>
MCMemberRecord group dump:<br>
MGID....................ff12:<u></u>401b:ffff::ffff:ffff<br>
Mlid....................0xC000<br>
Mtu.....................0x84<br>
pkey....................0xFFFF<br>
Rate....................0x83<br>
SL......................0x0<br>
<br>
# saquery -m 0xc000<br>
PortGid.................fe80::<u></u>1:5:ad00:c:5c3d (Topspin DDR-HCAe LX x8)<div class="im"><br>
PortGid.................fe80::<u></u>1:19:bbff:ff00:5851 (saga mthca0)<br></div>
PortGid.................fe80::<u></u>1:19:bbff:ff00:3899 (sfcomp1 mthca0)<div class="im"><br>
PortGid.................fe80::<u></u>1:1a:4bff:ff0c:20c9 (HP Lion Cub 128MB)<br>
PortGid.................fe80::<u></u>5:ad00:c:5ced (MT25204 InfiniHostLx Mellanox Technologies)<br>
PortGid.................fe80::<u></u>1:17:8ff:ffd0:9df9 (alexandria2 HCA-1)<br>
<br>
<br></div>
Seems like I may have two entries for the 5:ad00:c:5ced device? </blockquote><div> </div><div>Looks different to me than 5:ad00:c:5c3d which is Topspin one</div><div> </div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
Perhaps updating the firmware led to that (now it is MT25204 instead of Topspin).<div class="im"><br></div></blockquote><div> </div><div>Looks like your 2 subnets are "interconnected" so they're not really 2 disjoint subnets! Is your other subnet 0xfe80::5 ? Looking at your ibnetdiscover file, there's only 1 switch so are you running 2 SMs (one for each subnet) over the same topology. If so, that doesn't work.</div>
<div> </div><div>-- Hal</div><div> </div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote"><div class="im">
<br>
<br>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
I suspect it may not be due to:<br>
Jun 06 12:06:29 959894 [295C0700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B13:<br>
validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX x8),<br>
sending IB_SA_MAD_STATUS_REQ_INVALID<br>
is worrisome...<br>
</blockquote>
<br></div>
After having to reboot many machines to clear up the mess, I afraid to re-enable the IPoIB interface on the windows machine.<div class="HOEnZb"><div class="h5"><br>
<br>
<br>
-- <br>
Orion Poplawski<br>
Technical Manager <a href="tel:303-415-9701%20x222" target="_blank" value="+13034159701">303-415-9701 x222</a><br>
NWRA, Boulder/CoRA Office FAX: <a href="tel:303-415-9702" target="_blank" value="+13034159702">303-415-9702</a><br>
3380 Mitchell Lane <a href="mailto:orion@nwra.com" target="_blank">orion@nwra.com</a><br>
Boulder, CO 80301 <a href="http://www.nwra.com" target="_blank">http://www.nwra.com</a><br>
</div></div></blockquote></div><br>