[Users] Trouble with subnet_prefix

Hal Rosenstock hal.rosenstock at gmail.com
Tue Apr 30 10:57:17 PDT 2013


On Tue, Apr 30, 2013 at 1:35 PM, Orion Poplawski <orion at cora.nwra.com>wrote:

> On 04/30/2013 11:16 AM, Hal Rosenstock wrote:
>
>>
>> Hi,
>> On Tue, Apr 30, 2013 at 11:23 AM, Orion Poplawski <orion at cora.nwra.com
>> <mailto:orion at cora.nwra.com>> wrote:
>>
>>     I'm going to have some overlapping IB networks,
>>
>> What do you mean by overlapping IB networks ? Unlike IP subnets, when IB
>> subnets overlap, they are one subnet (managed by one master SM which
>> assigns a
>> single subnet prefix).
>>
>
> What I mean are a couple machines connected to two different infiniband
> networks.
>
>
With 2 separate SMs, one for each subnet ?


>
>      and to shut up openmpi's warning about multiple ports with the
>> default subnet,
>>
>> Looks like you may need two parallel disjoint subnets but I don't know
>> openmpi
>> well enough to be sure if there's some other way to configure openmpi.
>>
>
> openmpi complains if you have two ports connected to same subnet prefix
> and that prefix is the default prefix.  It's trying to be helpful about
> common networking mistakes.
>
>
>      I'm trying to change the subnet_prefix to 0xfe80000000000001 (in
>>     /etc/rdma/opensm.conf).  However, now things are not happy and I'm
>> seeing
>>     the following in opensm.log:
>>
>>     Apr 30 09:08:58 739460 [DA401700] 0x01 -> mcmr_rcv_join_mgrp: ERR
>> 1B11:
>>     method = SubnAdmSet, scope_state = 0x1, component mask =
>>     0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
>>     ff12:401b:ffff::ffff:ffff from port 0x0019bbffff005851 (saga mthca0)
>>     Apr 30 09:09:03 372476 [D17F3700] 0x01 -> mcmr_rcv_join_mgrp: ERR
>> 1B11:
>>     method = SubnAdmSet, scope_state = 0x1, component mask =
>>     0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
>>     ff12:401b:ffff::ffff:ffff from port 0x001708ffffd09df9 (alexandria2
>> HCA-1)
>>
>> OpenSM is complaining about the IPoIB broadcast group not being already
>> created and these joins are insufficient to create it.
>>
>
> Thanks!  So it was my messing around with partitions.conf that broke
> things! Adding the ipoib flag:
>
> Default=0x7fff, ipoib : ALL=full ;
>
> Did the trick.
>
>
>>     and I cannot ping remote IB IPs.
>>
>> Right; if the joins don't work, IPoIB connectivity won't work either.
>>
>>
>>     [root at saga ~]# ibstat
>>     CA 'mthca0'
>>              CA type: MT25208 (MT23108 compat mode)
>>              Number of ports: 2
>>              Firmware version: 4.7.400
>>              Hardware version: a0
>>              Node GUID: 0x0019bbffff005850
>>              System image GUID: 0x0019bbffff005853
>>              Port 1:
>>                      State: Active
>>                      Physical state: LinkUp
>>                      Rate: 8
>>                      Base lid: 1
>>                      LMC: 0
>>                      SM lid: 1
>>                      Capability mask: 0x02510a6a
>>                      Port GUID: 0x0019bbffff005851
>>                      Link layer: InfiniBand
>>              Port 2:
>>                      State: Active
>>                      Physical state: LinkUp
>>                      Rate: 8
>>                      Base lid: 4
>>                      LMC: 0
>>                      SM lid: 1
>>                      Capability mask: 0x02510a68
>>                      Port GUID: 0x0019bbffff005852
>>                      Link layer: InfiniBand
>>     [root at saga ~]# ip addr show dev ib0
>>     4: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast state
>> UNKNOWN
>>     qlen 256
>>          link/infiniband
>>     80:00:04:04:fe:80:00:00:00:00:**__00:01:00:19:bb:ff:ff:00:58:**51 brd
>>     00:ff:ff:ff:ff:12:40:1b:ff:ff:**__00:00:00:00:00:00:ff:ff:ff:**ff
>>          inet 192.168.2.12/24 <http://192.168.2.12/24> brd 192.168.2.255
>> scope
>>
>>     global ib0
>>
>>     [root at alexandria2 ~]# ibstat
>>     CA 'mthca0'
>>              CA type: MT25208 (MT23108 compat mode)
>>              Number of ports: 2
>>              Firmware version: 4.7.400
>>              Hardware version: a0
>>              Node GUID: 0x001708ffffd09df8
>>              System image GUID: 0x001708ffffd09dfb
>>              Port 1:
>>                      State: Active
>>                      Physical state: LinkUp
>>                      Rate: 10
>>                      Base lid: 9
>>                      LMC: 0
>>                      SM lid: 1
>>                      Capability mask: 0x02510a68
>>                      Port GUID: 0x001708ffffd09df9
>>                      Link layer: InfiniBand
>>              Port 2:
>>                      State: Active
>>                      Physical state: LinkUp
>>                      Rate: 10
>>                      Base lid: 8
>>                      LMC: 0
>>                      SM lid: 1
>>                      Capability mask: 0x02510a68
>>                      Port GUID: 0x001708ffffd09dfa
>>                      Link layer: InfiniBand
>>     [root at alexandria2 ~]# ip addr show dev ib0
>>     6: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast state
>> UNKNOWN
>>     qlen 256
>>          link/infiniband
>>     80:00:04:04:fe:80:00:00:00:00:**__00:01:00:17:08:ff:ff:d0:9d:**f9 brd
>>     00:ff:ff:ff:ff:12:40:1b:ff:ff:**__00:00:00:00:00:00:ff:ff:ff:**ff
>>          inet 192.168.2.16/24 <http://192.168.2.16/24> brd 192.168.2.255
>> scope
>>
>>     global ib0
>>
>>     [root at alexandria2 ~]# ibping -G 0x0019bbffff005851
>>     Pong from saga.cora.nwra.com.(none) (Lid 1): time 0.133 ms
>>     Pong from saga.cora.nwra.com.(none) (Lid 1): time 0.103 ms
>>     ^C
>>     --- saga.cora.nwra.com.(none) (Lid 1) ibping statistics ---
>>     2 packets transmitted, 2 received, 0% packet loss, time 1908 ms
>>     rtt min/avg/max = 0.103/0.118/0.133 ms
>>     [root at alexandria2 ~]# ibping -G 0x0019bbffff005852
>>     ibwarn: [3274] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 4)
>>     ibwarn: [3274] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 4)
>>     ^C
>>     ---  (Lid 4) ibping statistics ---
>>     2 packets transmitted, 0 received, 100% packet loss, time 7636 ms
>>     rtt min/avg/max = 0.000/0.000/0.000 ms
>>
>>
>>     I'm at a loss.  Any ideas?  Thanks!
>>
>> Without the topology, it's hard to tell what's going on.
>>
>
> At this point there is one switch and two machines each with a dual port
> card.  Both ports of each card are connected to the switch.
>
>
>

Just one switch or one switch per subnet ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130430/c0e66621/attachment.html>


More information about the Users mailing list