<br><br><div class="gmail_quote">On Tue, Apr 30, 2013 at 1:35 PM, Orion Poplawski <span dir="ltr"><<a href="mailto:orion@cora.nwra.com" target="_blank">orion@cora.nwra.com</a>></span> wrote:<br><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<div class="im">On 04/30/2013 11:16 AM, Hal Rosenstock wrote:<br>
</div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote"><div class="im">
<br>
Hi,<br>
On Tue, Apr 30, 2013 at 11:23 AM, Orion Poplawski <<a href="mailto:orion@cora.nwra.com" target="_blank">orion@cora.nwra.com</a><br></div><div class="im">
<mailto:<a href="mailto:orion@cora.nwra.com" target="_blank">orion@cora.nwra.com</a>>> wrote:<br>
<br>
I'm going to have some overlapping IB networks,<br>
<br>
What do you mean by overlapping IB networks ? Unlike IP subnets, when IB<br>
subnets overlap, they are one subnet (managed by one master SM which assigns a<br>
single subnet prefix).<br>
</div></blockquote>
<br>
What I mean are a couple machines connected to two different infiniband networks.<div class="im"><br></div></blockquote><div><div> </div><div>With 2 separate SMs, one for each subnet ? </div> </div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<div class="im">
<br>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
and to shut up openmpi's warning about multiple ports with the default subnet,<br>
<br>
Looks like you may need two parallel disjoint subnets but I don't know openmpi<br>
well enough to be sure if there's some other way to configure openmpi.<br>
</blockquote>
<br></div>
openmpi complains if you have two ports connected to same subnet prefix and that prefix is the default prefix. It's trying to be helpful about common networking mistakes.<div class="im"><br>
<br>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
I'm trying to change the subnet_prefix to 0xfe80000000000001 (in<br>
/etc/rdma/opensm.conf). However, now things are not happy and I'm seeing<br>
the following in opensm.log:<br>
<br>
Apr 30 09:08:58 739460 [DA401700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B11:<br>
method = SubnAdmSet, scope_state = 0x1, component mask =<br>
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:<br>
ff12:401b:ffff::ffff:ffff from port 0x0019bbffff005851 (saga mthca0)<br>
Apr 30 09:09:03 372476 [D17F3700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B11:<br>
method = SubnAdmSet, scope_state = 0x1, component mask =<br>
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:<br>
ff12:401b:ffff::ffff:ffff from port 0x001708ffffd09df9 (alexandria2 HCA-1)<br>
<br>
OpenSM is complaining about the IPoIB broadcast group not being already<br>
created and these joins are insufficient to create it.<br>
</blockquote>
<br></div>
Thanks! So it was my messing around with partitions.conf that broke things! Adding the ipoib flag:<br>
<br>
Default=0x7fff, ipoib : ALL=full ;<br>
<br>
Did the trick.<br>
<br>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote"><div><div class="h5">
<br>
and I cannot ping remote IB IPs.<br>
<br>
Right; if the joins don't work, IPoIB connectivity won't work either.<br>
<br>
<br>
[root@saga ~]# ibstat<br>
CA 'mthca0'<br>
CA type: MT25208 (MT23108 compat mode)<br>
Number of ports: 2<br>
Firmware version: 4.7.400<br>
Hardware version: a0<br>
Node GUID: 0x0019bbffff005850<br>
System image GUID: 0x0019bbffff005853<br>
Port 1:<br>
State: Active<br>
Physical state: LinkUp<br>
Rate: 8<br>
Base lid: 1<br>
LMC: 0<br>
SM lid: 1<br>
Capability mask: 0x02510a6a<br>
Port GUID: 0x0019bbffff005851<br>
Link layer: InfiniBand<br>
Port 2:<br>
State: Active<br>
Physical state: LinkUp<br>
Rate: 8<br>
Base lid: 4<br>
LMC: 0<br>
SM lid: 1<br>
Capability mask: 0x02510a68<br>
Port GUID: 0x0019bbffff005852<br>
Link layer: InfiniBand<br>
[root@saga ~]# ip addr show dev ib0<br>
4: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast state UNKNOWN<br>
qlen 256<br>
link/infiniband<br></div></div>
80:00:04:04:fe:80:00:00:00:00:<u></u>__00:01:00:19:bb:ff:ff:00:58:<u></u>51 brd<br>
00:ff:ff:ff:ff:12:40:1b:ff:ff:<u></u>__00:00:00:00:00:00:ff:ff:ff:<u></u>ff<br>
inet <a href="http://192.168.2.12/24" target="_blank">192.168.2.12/24</a> <<a href="http://192.168.2.12/24" target="_blank">http://192.168.2.12/24</a>> brd 192.168.2.255 scope<div><div class="h5"><br>
global ib0<br>
<br>
[root@alexandria2 ~]# ibstat<br>
CA 'mthca0'<br>
CA type: MT25208 (MT23108 compat mode)<br>
Number of ports: 2<br>
Firmware version: 4.7.400<br>
Hardware version: a0<br>
Node GUID: 0x001708ffffd09df8<br>
System image GUID: 0x001708ffffd09dfb<br>
Port 1:<br>
State: Active<br>
Physical state: LinkUp<br>
Rate: 10<br>
Base lid: 9<br>
LMC: 0<br>
SM lid: 1<br>
Capability mask: 0x02510a68<br>
Port GUID: 0x001708ffffd09df9<br>
Link layer: InfiniBand<br>
Port 2:<br>
State: Active<br>
Physical state: LinkUp<br>
Rate: 10<br>
Base lid: 8<br>
LMC: 0<br>
SM lid: 1<br>
Capability mask: 0x02510a68<br>
Port GUID: 0x001708ffffd09dfa<br>
Link layer: InfiniBand<br>
[root@alexandria2 ~]# ip addr show dev ib0<br>
6: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast state UNKNOWN<br>
qlen 256<br>
link/infiniband<br></div></div>
80:00:04:04:fe:80:00:00:00:00:<u></u>__00:01:00:17:08:ff:ff:d0:9d:<u></u>f9 brd<br>
00:ff:ff:ff:ff:12:40:1b:ff:ff:<u></u>__00:00:00:00:00:00:ff:ff:ff:<u></u>ff<br>
inet <a href="http://192.168.2.16/24" target="_blank">192.168.2.16/24</a> <<a href="http://192.168.2.16/24" target="_blank">http://192.168.2.16/24</a>> brd 192.168.2.255 scope<div class="im"><br>
global ib0<br>
<br>
[root@alexandria2 ~]# ibping -G 0x0019bbffff005851<br>
Pong from saga.cora.nwra.com.(none) (Lid 1): time 0.133 ms<br>
Pong from saga.cora.nwra.com.(none) (Lid 1): time 0.103 ms<br>
^C<br>
--- saga.cora.nwra.com.(none) (Lid 1) ibping statistics ---<br>
2 packets transmitted, 2 received, 0% packet loss, time 1908 ms<br>
rtt min/avg/max = 0.103/0.118/0.133 ms<br>
[root@alexandria2 ~]# ibping -G 0x0019bbffff005852<br>
ibwarn: [3274] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 4)<br>
ibwarn: [3274] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 4)<br>
^C<br>
--- (Lid 4) ibping statistics ---<br>
2 packets transmitted, 0 received, 100% packet loss, time 7636 ms<br>
rtt min/avg/max = 0.000/0.000/0.000 ms<br>
<br>
<br>
I'm at a loss. Any ideas? Thanks!<br>
<br>
Without the topology, it's hard to tell what's going on.<br>
</div></blockquote><p>
<br>
At this point there is one switch and two machines each with a dual port card. Both ports of each card are connected to the switch.</p><p> </p></blockquote><div> </div></div><div>Just one switch or one switch per subnet ?</div>
<div> </div>