[Users] Weird IPoIB issue

Robert LeBlanc robert_leblanc at byu.edu
Wed Nov 13 10:38:27 PST 2013


We are on the latest version of firmware for all of our switches (as of
last month). I guess I'll have to check with Oracle and see if they are
setting this parameter in their subnet manager. Just to confirm, using
smpdump (or similar) to change the value won't do any good because the
subnet manager will just change it back?

I think this is the cause of the problems, now to get it fixed.

Thanks,


Robert LeBlanc
OIT Infrastructure & Virtualization Engineer
Brigham Young University


On Wed, Nov 13, 2013 at 11:34 AM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:

> In general, MulticastFDBTop should be 0 or some value above 0xc000.
>
>
> Indicates the upper bound of the range of the multicast
>
> forwarding table. Packets received with MLIDs greater
>
> than MulticastFDBTop are considered to be outside the
>
> range of the Multicast Forwarding Table (see
>
> 18.2.4.3.3
>
> Required Multicast Relay on page 1072
>
> ). A valid MulticastFDBTop
>
> is less than MulticastFDBCap + 0xC000.
>
> This component applies only to switches that implement
>
> the optional multicast forwarding service. A switch
>
> shall ignore the MulticastFDBTop component if it has
>
> the value zero. The initial value for MulticastFDBTop
>
> shall be set to zero. A value of 0xBFFF means there are
>
> no MulticastForwardingTable entries.
> It is set by OpenSM. There is a parameter to disable it's use (use_mfttop)
> which can be set to FALSE. This may depend on which OpenSM version you are
> running. In order to get out of this state, you may need to reset any
> switches which have this parameter set like this.
>
> Any idea on the firmware versions in your various switches ?
>
> -- Hal
>
>
> On Wed, Nov 13, 2013 at 1:16 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>
>> Sorry to take so long, I've been busy with other things. Here is the
>> output:
>>
>> [root at desxi003 ~]# smpquery si 52
>> # Switch info: Lid 52
>> LinearFdbCap:....................49152
>> RandomFdbCap:....................0
>> McastFdbCap:.....................4096
>> LinearFdbTop:....................189
>> DefPort:.........................0
>> DefMcastPrimPort:................255
>> DefMcastNotPrimPort:.............255
>> LifeTime:........................18
>> StateChange:.....................0
>> OptSLtoVLMapping:................1
>> LidsPerPort:.....................0
>> PartEnforceCap:..................32
>> InboundPartEnf:..................1
>> OutboundPartEnf:.................1
>> FilterRawInbound:................1
>> FilterRawOutbound:...............1
>> EnhancedPort0:...................0
>> MulticastFDBTop:.................0xbfff
>> [root at desxi003 ~]# smpquery pi 52 0
>> # Port info: Lid 52 port 0
>> Mkey:............................0x0000000000000000
>> GidPrefix:.......................0xfe80000000000000
>> Lid:.............................52
>> SMLid:...........................49
>> CapMask:.........................0x42500848
>>                                 IsTrapSupported
>>                                 IsSLMappingSupported
>>                                 IsSystemImageGUIDsupported
>>                                 IsVendorClassSupported
>>                                  IsCapabilityMaskNoticeSupported
>>                                 IsClientRegistrationSupported
>>                                 IsMulticastFDBTopSupported
>> DiagCode:........................0x0000
>> MkeyLeasePeriod:.................0
>> LocalPort:.......................1
>> LinkWidthEnabled:................1X or 4X
>> LinkWidthSupported:..............1X or 4X
>> LinkWidthActive:.................4X
>> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
>> LinkState:.......................Active
>> PhysLinkState:...................LinkUp
>> LinkDownDefState:................Polling
>> ProtectBits:.....................0
>> LMC:.............................0
>> LinkSpeedActive:.................10.0 Gbps
>> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
>> NeighborMTU:.....................4096
>> SMSL:............................0
>> VLCap:...........................VL0
>> InitType:........................0x00
>> VLHighLimit:.....................0
>> VLArbHighCap:....................0
>> VLArbLowCap:.....................0
>> InitReply:.......................0x00
>> MtuCap:..........................4096
>> VLStallCount:....................0
>> HoqLife:.........................0
>> OperVLs:.........................VL0
>> PartEnforceInb:..................0
>> PartEnforceOutb:.................0
>> FilterRawInb:....................0
>> FilterRawOutb:...................0
>> MkeyViolations:..................0
>> PkeyViolations:..................0
>> QkeyViolations:..................0
>> GuidCap:.........................1
>> ClientReregister:................0
>> McastPkeyTrapSuppressionEnabled:.0
>> SubnetTimeout:...................18
>> RespTimeVal:.....................20
>> LocalPhysErr:....................0
>> OverrunErr:......................0
>> MaxCreditHint:...................0
>> RoundTrip:.......................0
>>
>> From what I've read in the Mellanox Release Notes MultiCastFDBTop=0xBFFF
>> is supposed to discard MC traffic. The question is, how do I set this value
>> to something else and what should it be set to?
>>
>> Thanks,
>>
>>
>> Robert LeBlanc
>> OIT Infrastructure & Virtualization Engineer
>> Brigham Young University
>>
>>
>> On Wed, Oct 30, 2013 at 12:28 PM, Hal Rosenstock <
>> hal.rosenstock at gmail.com> wrote:
>>
>>>  Determine LID of switch (in the below say switch is lid x)
>>> Then:
>>>
>>> smpquery si x
>>> (of interest are McastFdbCap and MulticastFDBTop)
>>>  smpquery pi x 0
>>> (of interest is CapMask)
>>> ibroute -M x
>>>
>>>
>>>
>>> On Tue, Oct 29, 2013 at 3:56 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>>
>>>> Both ports show up in the "saquery MCMR" results with a JoinState of
>>>> 0x1.
>>>>
>>>> How can I dump the parameters of a non-managed switch so that I can
>>>> confirm that multicast is not turned off on the Dell chassis IB switches?
>>>>
>>>>
>>>> Robert LeBlanc
>>>> OIT Infrastructure & Virtualization Engineer
>>>> Brigham Young University
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 5:04 PM, Coulter, Susan K <skc at lanl.gov> wrote:
>>>>
>>>>>
>>>>>  /sys/class/net should give you the details on your devices, like
>>>>> this:
>>>>>
>>>>>  -bash-4.1# cd /sys/class/net
>>>>> -bash-4.1# ls -l
>>>>> total 0
>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth0 ->
>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.0/net/eth0
>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth1 ->
>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.1/net/eth1
>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib0 ->
>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib0
>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib1 ->
>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib1
>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib2 ->
>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib2
>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib3 ->
>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib3
>>>>>
>>>>>  Then use "lspci | grep Mell"  to get the pci device numbers.
>>>>>
>>>>>  47:00.0 Network controller: Mellanox Technologies MT26428 [ConnectX
>>>>> VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>> c7:00.0 Network controller: Mellanox Technologies MT26428 [ConnectX
>>>>> VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>
>>>>>  In this example, ib0 and 1 are referencing the device at  47:00.0
>>>>> And ib2 and ib3 are referencing the device at c7:00.0
>>>>>
>>>>>  That said, if you only have one card - this is probably not the
>>>>> problem.
>>>>> Additionally, since the arp requests are being seen going out ib0,
>>>>> your emulation appears to be working.
>>>>>
>>>>>  If those arp requests are not being seen on the other end, it seems
>>>>> like a problem with the mgids.
>>>>> Like maybe the port you are trying to reach is not in the IPoIB
>>>>> multicast group?
>>>>>
>>>>>  You can look at all the multicast member records with "saquery MCMR".
>>>>> Or - you can grep for mcmr_rcv_join_mgrp references in your SM logs …
>>>>>
>>>>>  HTH
>>>>>
>>>>>
>>>>>
>>>>>  On Oct 28, 2013, at 1:08 PM, Robert LeBlanc <robert_leblanc at byu.edu>
>>>>> wrote:
>>>>>
>>>>>  I can ibping between both hosts just fine.
>>>>>
>>>>>  [root at desxi003 ~]# ibping 0x37
>>>>> Pong from desxi004.(none) (Lid 55): time 0.111 ms
>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>> Pong from desxi004.(none) (Lid 55): time 0.179 ms
>>>>> ^C
>>>>> --- desxi004.(none) (Lid 55) ibping statistics ---
>>>>> 4 packets transmitted, 4 received, 0% packet loss, time 3086 ms
>>>>> rtt min/avg/max = 0.111/0.167/0.189 ms
>>>>>
>>>>>  [root at desxi004 ~]# ibping 0x2d
>>>>> Pong from desxi003.(none) (Lid 45): time 0.156 ms
>>>>> Pong from desxi003.(none) (Lid 45): time 0.175 ms
>>>>> Pong from desxi003.(none) (Lid 45): time 0.176 ms
>>>>> ^C
>>>>> --- desxi003.(none) (Lid 45) ibping statistics ---
>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2302 ms
>>>>> rtt min/avg/max = 0.156/0.169/0.176 ms
>>>>>
>>>>>  When I do an Ethernet ping to the IPoIB address, tcpdump only shows
>>>>> the outgoing ARP request.
>>>>>
>>>>>  [root at desxi003 ~]# tcpdump -i ib0
>>>>> tcpdump: verbose output suppressed, use -v or -vv for full protocol
>>>>> decode
>>>>> listening on ib0, link-type LINUX_SLL (Linux cooked), capture size
>>>>> 65535 bytes
>>>>> 19:00:08.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>> length 56
>>>>> 19:00:09.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>> length 56
>>>>> 19:00:10.950307 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>> length 56
>>>>>
>>>>>  Running tcpdump on the rack servers I don't see the ARP request
>>>>> there which I should.
>>>>>
>>>>>  From what I've read, ib0 should be mapped to the first port and ib1
>>>>> should be mapped to the second port. We have one IB card with two ports.
>>>>> The modprobe is the default installed with the Mellanox drivers.
>>>>>
>>>>>  [root at desxi003 etc]# cat modprobe.d/ib_ipoib.conf
>>>>> # install ib_ipoib modprobe --ignore-install ib_ipoib &&
>>>>> /sbin/ib_ipoib_sysctl load
>>>>> # remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe -r
>>>>> --ignore-remove ib_ipoib
>>>>> alias ib0 ib_ipoib
>>>>> alias ib1 ib_ipoib
>>>>>
>>>>>  Can you give me some pointers on digging into the device layer to
>>>>> make sure IPoIB is connected correctly? Would I look in /sys or /proc for
>>>>> that?
>>>>>
>>>>>  Dell has not been able to replicate the problem in their environment
>>>>> and they only support Red Hat and won't work with my CentOS live CD. These
>>>>> blades don't have internal hard drives so it makes it hard to install any
>>>>> OS. I don't know if I can engage Mellanox since they build the switch
>>>>> hardware and driver stack we are using.
>>>>>
>>>>>  I really appreciate all the help you guys have given thus far, I'm
>>>>> learning a lot as this progresses. I'm reading through
>>>>> https://tools.ietf.org/html/rfc4391 trying to understand IPoIB from
>>>>> top to bottom.
>>>>>
>>>>>  Thanks,
>>>>>
>>>>>
>>>>>  Robert LeBlanc
>>>>> OIT Infrastructure & Virtualization Engineer
>>>>> Brigham Young University
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 12:53 PM, Coulter, Susan K <skc at lanl.gov>wrote:
>>>>>
>>>>>>
>>>>>>  If you are not seeing any packets leave the ib0 interface, it sounds
>>>>>> like the emulation layer is not connected to the right device.
>>>>>>
>>>>>>  If ib_ipoib kernel module is loaded, and a simple native IB test
>>>>>> works between those blades - (like ib_read_bw) you need to dig into the
>>>>>> device layer and insure ipoib is "connected" to the right device.
>>>>>>
>>>>>>  Do you have more than 1 IB card?
>>>>>> What does your modprobe config look like for ipoib?
>>>>>>
>>>>>>
>>>>>>   On Oct 28, 2013, at 12:38 PM, Robert LeBlanc <
>>>>>> robert_leblanc at byu.edu>
>>>>>>   wrote:
>>>>>>
>>>>>>  These ESX hosts (2 blade server and 2 rack servers) are booted into
>>>>>> a CentOS 6.2 Live CD that I built. Right now everything I'm trying to get
>>>>>> working is CentOS 6.2. All of our other hosts are running ESXi and have
>>>>>> IPoIB interfaces, but none of them are configured and I'm not trying to get
>>>>>> those working right now.
>>>>>>
>>>>>>  Ideally, we would like our ESX hosts to communicate with each other
>>>>>> for vMotion and protected VM traffic as well as with our Commvault backup
>>>>>> servers (Windows) over IPoIB (or Oracle's PVI which is very similar).
>>>>>>
>>>>>>
>>>>>>  Robert LeBlanc
>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>> Brigham Young University
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 12:33 PM, Hal Rosenstock <
>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>
>>>>>>> Are those ESXi IPoIB interfaces ? Do some of these work and others
>>>>>>> not ? Are there normal Linux IPoIB interfaces ? Do they work ?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 2:24 PM, Robert LeBlanc <
>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>
>>>>>>>> Yes, I can not ping them over the IPoIB interface. It is a very
>>>>>>>> simple network set-up.
>>>>>>>>
>>>>>>>>  desxi003
>>>>>>>>  8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>>>> pfifo_fast state UP qlen 256
>>>>>>>>     link/infiniband
>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>     inet 192.168.9.3/24 brd 192.168.9.255 scope global ib0
>>>>>>>>     inet6 fe80::f24d:a290:9778:e7d1/64 scope link
>>>>>>>>        valid_lft forever preferred_lft forever
>>>>>>>>
>>>>>>>>  desxi004
>>>>>>>>  8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>>>> pfifo_fast state UP qlen 256
>>>>>>>>     link/infiniband
>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>     inet 192.168.9.4/24 brd 192.168.9.255 scope global ib0
>>>>>>>>     inet6 fe80::f24d:a290:9778:e715/64 scope link
>>>>>>>>        valid_lft forever preferred_lft forever
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  Robert LeBlanc
>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>> Brigham Young University
>>>>>>>>
>>>>>>>>
>>>>>>>>  On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock <
>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> So these 2 hosts have trouble talking IPoIB to each other ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <
>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>
>>>>>>>>>> I was just wondering about that. It seems reasonable that the
>>>>>>>>>> broadcast traffic would go over multicast, but effectively channels would
>>>>>>>>>> be created for node to node communication, otherwise the entire multicast
>>>>>>>>>> group would be limited to 10 Gbps (in this instance) for the whole group.
>>>>>>>>>> That doesn't scale very well.
>>>>>>>>>>
>>>>>>>>>>  The things I've read about IPoIB performance tuning seems
>>>>>>>>>> pretty vague, and the changes most people recommend seem to be already in
>>>>>>>>>> place on the systems I'm using. Some people said, try using a newer version
>>>>>>>>>> of Ubuntu, but ultimately, I have very little control over VMware. Once I
>>>>>>>>>> can get the Linux machines to communicate IPoIB between the racks and
>>>>>>>>>> blades, then I'm going to turn my attention over to performance
>>>>>>>>>> optimization. It doesn't seem to make much sense to spend time there when
>>>>>>>>>> it is not working at all for most machines.
>>>>>>>>>>
>>>>>>>>>>  I've done ibtracert between the two nodes, is that what you
>>>>>>>>>> mean by walking the route?
>>>>>>>>>>
>>>>>>>>>>  [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>>>>>>>>>> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>>> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV Mellanox
>>>>>>>>>> Technologies"
>>>>>>>>>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
>>>>>>>>>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>>>
>>>>>>>>>>  [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>>>>>>>>>> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>>> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV Mellanox
>>>>>>>>>> Technologies"
>>>>>>>>>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
>>>>>>>>>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>>>
>>>>>>>>>>  As you can see, the route is on the same switch, the blades are
>>>>>>>>>> right next to each other.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>> Brigham Young University
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>  Which mystery is explained ? The 10 Gbps is a multicast only
>>>>>>>>>>> limit and does not apply to unicast. The BW limitation you're seeing is due
>>>>>>>>>>> to other factors. There's been much written about IPoIB performance.
>>>>>>>>>>>
>>>>>>>>>>> If all the MC members are joined and routed, then the IPoIB
>>>>>>>>>>> connectivity issue is some other issue. Are you sure this is the case ? Did
>>>>>>>>>>> you walk the route between 2 nodes where you have a connectivity issue ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <
>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Well, that explains one mystery, now I need to figure out why
>>>>>>>>>>>> it seems the Dell blades are not passing the traffic.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>  Yes, that's the IPoIB IPv4 broadcast group for the default
>>>>>>>>>>>>> (0xffff) partition. 0x80 part of mtu and rate just means "is equal to". mtu
>>>>>>>>>>>>> 0x04 is 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the defaults.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The info for that MGID is:
>>>>>>>>>>>>>> MCMemberRecord group dump:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>>>>>>>>>>>                 Mlid....................0xC000
>>>>>>>>>>>>>>                 Mtu.....................0x84
>>>>>>>>>>>>>>                 pkey....................0xFFFF
>>>>>>>>>>>>>>                 Rate....................0x83
>>>>>>>>>>>>>>                 SL......................0x0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  I don't understand the MTU and Rate (130 and 131 dec). When
>>>>>>>>>>>>>> I run iperf between the two hosts over IPoIB in connected mode and MTU
>>>>>>>>>>>>>> 65520. I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  saquery -g should show what MGID is mapped to MLID 0xc000
>>>>>>>>>>>>>>> and the group parameters.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  When you say 10 Gbps max, is that multicast or unicast ?
>>>>>>>>>>>>>>> That limit is only on the multicast.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well, that can explain why I'm only able to get 10 Gbps max
>>>>>>>>>>>>>>>> from the two hosts that are working.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  I have tried updn and dnup and they didn't help either. I
>>>>>>>>>>>>>>>> think the only thing that will help is Automatic Path Migration is it tries
>>>>>>>>>>>>>>>> very hard to route the alternative LIDs through different systemguids. I
>>>>>>>>>>>>>>>> suspect it would require re-LIDing everything which would mean an outage.
>>>>>>>>>>>>>>>> I'm still trying to get answers from Oracle if that is even a possibility.
>>>>>>>>>>>>>>>> I've tried seeding some of the algorithms with information like root nodes,
>>>>>>>>>>>>>>>> etc, but none of them worked better.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  The MLID 0xc000 exists and I can see all the nodes joined
>>>>>>>>>>>>>>>> to the group using saquery. I've checked the route using ibtracert
>>>>>>>>>>>>>>>> specifying the MLID. The only thing I'm not sure how to check is the group
>>>>>>>>>>>>>>>> parameters. What tool would I use for that?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Xsigo's SM is not "straight" OpenSM. They have some
>>>>>>>>>>>>>>>>> proprietary enhancements and it may be based on old vintage of OpenSM. You
>>>>>>>>>>>>>>>>> will likely need to work with them/Oracle now on issues.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Lack of a partitions file does mean default partition and
>>>>>>>>>>>>>>>>> default rate (10 Gbps) so from what I saw all ports had sufficient rate to
>>>>>>>>>>>>>>>>> join MC group.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are certain topology requirements for running
>>>>>>>>>>>>>>>>> various routing algorithms. Did you try updn or dnup ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The key is determining whether the IPoIB broadcast group
>>>>>>>>>>>>>>>>> is setup correctly. What MLID is the group built on (usually 0xc000) ? What
>>>>>>>>>>>>>>>>> are the group parameters (rate, MTU) ? Are all members that are running
>>>>>>>>>>>>>>>>> IPoIB joined ? Is the group routed to all such members ? There are
>>>>>>>>>>>>>>>>> infiniband-diags for all of this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is using
>>>>>>>>>>>>>>>>>> minhop. I've loaded the ibnetdiscover output into ibsim and run all the
>>>>>>>>>>>>>>>>>> different routing algorithms against it with and without scatter ports.
>>>>>>>>>>>>>>>>>> Minhop had 50% of our hosts running all paths through a single IS5030
>>>>>>>>>>>>>>>>>> switch (at least the LIDs we need which represent Ethernet and Fibre
>>>>>>>>>>>>>>>>>> Channel cards the hosts should communicate with). Ftree, dor, and dfsssp
>>>>>>>>>>>>>>>>>> failed back to minhop, the others routed more paths through the same IS5030
>>>>>>>>>>>>>>>>>> in some cases increasing our host count with single point of failure to
>>>>>>>>>>>>>>>>>> 75%.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  As far as I can tell there is no partitions.conf file
>>>>>>>>>>>>>>>>>> so I assume we are using the default partition. There is an opensm.opts
>>>>>>>>>>>>>>>>>> file, but it only specifies logging information.
>>>>>>>>>>>>>>>>>>  # SA database file name
>>>>>>>>>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  # If TRUE causes OpenSM to dump SA database at the end
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  # The directory to hold the file OpenSM dumps
>>>>>>>>>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  The SM node is:
>>>>>>>>>>>>>>>>>>  xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  We do have Switch-X in two of the Dell m1000e chassis
>>>>>>>>>>>>>>>>>> but the cards, ports 17-32, are FDR10 (the switch may be straight FDR, but
>>>>>>>>>>>>>>>>>> I'm not 100% sure). The IS5030 are QDR which the Switch-X are connected to,
>>>>>>>>>>>>>>>>>> the switches in the Xsigo directors are QDR, but the Ethernet and Fibre
>>>>>>>>>>>>>>>>>> Channel cards are DDR. The DDR cards will not be running IPoIB (at least to
>>>>>>>>>>>>>>>>>> my knowledge they don't have the ability), only the hosts should be
>>>>>>>>>>>>>>>>>> leveraging IPoIB. I hope that clears up some of your questions. If you have
>>>>>>>>>>>>>>>>>> more, I will try to answer them.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  What routing algorithm is configured in OpenSM ? What
>>>>>>>>>>>>>>>>>>> does your partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, I only see QDR and DDR links although you have
>>>>>>>>>>>>>>>>>>> Switch-X so I assume all FDR ports are connected to slower (QDR) devices. I
>>>>>>>>>>>>>>>>>>> don't see any FDR-10 ports but maybe they're also connected to QDR ports so
>>>>>>>>>>>>>>>>>>> show up as QDR in the topology.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether or
>>>>>>>>>>>>>>>>>>> not they run IPoIB.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  Since you guys are amazingly helpful, I thought I
>>>>>>>>>>>>>>>>>>>> would pick your brains in a new problem.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  We have two Xsigo directors cross connected to four
>>>>>>>>>>>>>>>>>>>> Mellanox IS5030 switches. Connected to those we have four Dell m1000e
>>>>>>>>>>>>>>>>>>>> chassis each with two IB switches (two chassis have QDR and two have
>>>>>>>>>>>>>>>>>>>> FDR10). We have 9 dual-port rack servers connected to the IS5030 switches.
>>>>>>>>>>>>>>>>>>>> For testing purposes we have an additional Dell m1000e QDR chassis
>>>>>>>>>>>>>>>>>>>> connected to one Xsigo director and two dual-port FDR10 rack servers
>>>>>>>>>>>>>>>>>>>> connected to the other Xsigo director.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  I can get IPoIB to work between the two test rack
>>>>>>>>>>>>>>>>>>>> servers connected to the one Xsigo director. But I can not get IPoIB to
>>>>>>>>>>>>>>>>>>>> work between any blades either right next to each other or to the working
>>>>>>>>>>>>>>>>>>>> rack servers. I'm using the same exact live CentOS ISO on all four servers.
>>>>>>>>>>>>>>>>>>>> I've checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  I've read about MulticastFDBTop=0xBFFF but I don't
>>>>>>>>>>>>>>>>>>>> know how to set it and I doubt it would have been set by default.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  Anyone have some ideas on troubleshooting steps to
>>>>>>>>>>>>>>>>>>>> try? I think Google is tired of me asking questions about it.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>  _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>
>>>>>>
>>>>>>  ====================================
>>>>>>
>>>>>>  Susan Coulter
>>>>>> HPC-3 Network/Infrastructure
>>>>>> 505-667-8425
>>>>>> Increase the Peace...
>>>>>> An eye for an eye leaves the whole world blind
>>>>>> ====================================
>>>>>>
>>>>>>
>>>>>
>>>>>  ====================================
>>>>>
>>>>>  Susan Coulter
>>>>> HPC-3 Network/Infrastructure
>>>>> 505-667-8425
>>>>> Increase the Peace...
>>>>> An eye for an eye leaves the whole world blind
>>>>> ====================================
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131113/507a950a/attachment.html>


More information about the Users mailing list