[Users] Weird IPoIB issue

Wed Nov 13 10:16:11 PST 2013

Sorry to take so long, I've been busy with other things. Here is the output:

[root at desxi003 ~]# smpquery si 52
# Switch info: Lid 52
LinearFdbCap:....................49152
RandomFdbCap:....................0
McastFdbCap:.....................4096
LinearFdbTop:....................189
DefPort:.........................0
DefMcastPrimPort:................255
DefMcastNotPrimPort:.............255
LifeTime:........................18
StateChange:.....................0
OptSLtoVLMapping:................1
LidsPerPort:.....................0
PartEnforceCap:..................32
InboundPartEnf:..................1
OutboundPartEnf:.................1
FilterRawInbound:................1
FilterRawOutbound:...............1
EnhancedPort0:...................0
MulticastFDBTop:.................0xbfff
[root at desxi003 ~]# smpquery pi 52 0
# Port info: Lid 52 port 0
Mkey:............................0x0000000000000000
GidPrefix:.......................0xfe80000000000000
Lid:.............................52
SMLid:...........................49
CapMask:.........................0x42500848
                                IsTrapSupported
                                IsSLMappingSupported
                                IsSystemImageGUIDsupported
                                IsVendorClassSupported
                                IsCapabilityMaskNoticeSupported
                                IsClientRegistrationSupported
                                IsMulticastFDBTopSupported
DiagCode:........................0x0000
MkeyLeasePeriod:.................0
LocalPort:.......................1
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkDownDefState:................Polling
ProtectBits:.....................0
LMC:.............................0
LinkSpeedActive:.................10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:.....................4096
SMSL:............................0
VLCap:...........................VL0
InitType:........................0x00
VLHighLimit:.....................0
VLArbHighCap:....................0
VLArbLowCap:.....................0
InitReply:.......................0x00
MtuCap:..........................4096
VLStallCount:....................0
HoqLife:.........................0
OperVLs:.........................VL0
PartEnforceInb:..................0
PartEnforceOutb:.................0
FilterRawInb:....................0
FilterRawOutb:...................0
MkeyViolations:..................0
PkeyViolations:..................0
QkeyViolations:..................0
GuidCap:.........................1
ClientReregister:................0
McastPkeyTrapSuppressionEnabled:.0
SubnetTimeout:...................18
RespTimeVal:.....................20
LocalPhysErr:....................0
OverrunErr:......................0
MaxCreditHint:...................0
RoundTrip:.......................0

>From what I've read in the Mellanox Release Notes MultiCastFDBTop=0xBFFF is
supposed to discard MC traffic. The question is, how do I set this value to
something else and what should it be set to?

Thanks,

Robert LeBlanc
OIT Infrastructure & Virtualization Engineer
Brigham Young University

On Wed, Oct 30, 2013 at 12:28 PM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:

>  Determine LID of switch (in the below say switch is lid x)
> Then:
>
> smpquery si x
> (of interest are McastFdbCap and MulticastFDBTop)
>  smpquery pi x 0
> (of interest is CapMask)
> ibroute -M x
>
>
>
> On Tue, Oct 29, 2013 at 3:56 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>
>> Both ports show up in the "saquery MCMR" results with a JoinState of 0x1.
>>
>> How can I dump the parameters of a non-managed switch so that I can
>> confirm that multicast is not turned off on the Dell chassis IB switches?
>>
>>
>> Robert LeBlanc
>> OIT Infrastructure & Virtualization Engineer
>> Brigham Young University
>>
>>
>> On Mon, Oct 28, 2013 at 5:04 PM, Coulter, Susan K <skc at lanl.gov> wrote:
>>
>>>
>>>  /sys/class/net should give you the details on your devices, like this:
>>>
>>>  -bash-4.1# cd /sys/class/net
>>> -bash-4.1# ls -l
>>> total 0
>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth0 ->
>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.0/net/eth0
>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth1 ->
>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.1/net/eth1
>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib0 ->
>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib0
>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib1 ->
>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib1
>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib2 ->
>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib2
>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib3 ->
>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib3
>>>
>>>  Then use "lspci | grep Mell"  to get the pci device numbers.
>>>
>>>  47:00.0 Network controller: Mellanox Technologies MT26428 [ConnectX
>>> VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>> c7:00.0 Network controller: Mellanox Technologies MT26428 [ConnectX VPI
>>> PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>
>>>  In this example, ib0 and 1 are referencing the device at  47:00.0
>>> And ib2 and ib3 are referencing the device at c7:00.0
>>>
>>>  That said, if you only have one card - this is probably not the
>>> problem.
>>> Additionally, since the arp requests are being seen going out ib0, your
>>> emulation appears to be working.
>>>
>>>  If those arp requests are not being seen on the other end, it seems
>>> like a problem with the mgids.
>>> Like maybe the port you are trying to reach is not in the IPoIB
>>> multicast group?
>>>
>>>  You can look at all the multicast member records with "saquery MCMR".
>>> Or - you can grep for mcmr_rcv_join_mgrp references in your SM logs …
>>>
>>>  HTH
>>>
>>>
>>>
>>>  On Oct 28, 2013, at 1:08 PM, Robert LeBlanc <robert_leblanc at byu.edu>
>>> wrote:
>>>
>>>  I can ibping between both hosts just fine.
>>>
>>>  [root at desxi003 ~]# ibping 0x37
>>> Pong from desxi004.(none) (Lid 55): time 0.111 ms
>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>> Pong from desxi004.(none) (Lid 55): time 0.179 ms
>>> ^C
>>> --- desxi004.(none) (Lid 55) ibping statistics ---
>>> 4 packets transmitted, 4 received, 0% packet loss, time 3086 ms
>>> rtt min/avg/max = 0.111/0.167/0.189 ms
>>>
>>>  [root at desxi004 ~]# ibping 0x2d
>>> Pong from desxi003.(none) (Lid 45): time 0.156 ms
>>> Pong from desxi003.(none) (Lid 45): time 0.175 ms
>>> Pong from desxi003.(none) (Lid 45): time 0.176 ms
>>> ^C
>>> --- desxi003.(none) (Lid 45) ibping statistics ---
>>> 3 packets transmitted, 3 received, 0% packet loss, time 2302 ms
>>> rtt min/avg/max = 0.156/0.169/0.176 ms
>>>
>>>  When I do an Ethernet ping to the IPoIB address, tcpdump only shows
>>> the outgoing ARP request.
>>>
>>>  [root at desxi003 ~]# tcpdump -i ib0
>>> tcpdump: verbose output suppressed, use -v or -vv for full protocol
>>> decode
>>> listening on ib0, link-type LINUX_SLL (Linux cooked), capture size 65535
>>> bytes
>>> 19:00:08.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>> length 56
>>> 19:00:09.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>> length 56
>>> 19:00:10.950307 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>> length 56
>>>
>>>  Running tcpdump on the rack servers I don't see the ARP request there
>>> which I should.
>>>
>>>  From what I've read, ib0 should be mapped to the first port and ib1
>>> should be mapped to the second port. We have one IB card with two ports.
>>> The modprobe is the default installed with the Mellanox drivers.
>>>
>>>  [root at desxi003 etc]# cat modprobe.d/ib_ipoib.conf
>>> # install ib_ipoib modprobe --ignore-install ib_ipoib &&
>>> /sbin/ib_ipoib_sysctl load
>>> # remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe -r
>>> --ignore-remove ib_ipoib
>>> alias ib0 ib_ipoib
>>> alias ib1 ib_ipoib
>>>
>>>  Can you give me some pointers on digging into the device layer to make
>>> sure IPoIB is connected correctly? Would I look in /sys or /proc for that?
>>>
>>>  Dell has not been able to replicate the problem in their environment
>>> and they only support Red Hat and won't work with my CentOS live CD. These
>>> blades don't have internal hard drives so it makes it hard to install any
>>> OS. I don't know if I can engage Mellanox since they build the switch
>>> hardware and driver stack we are using.
>>>
>>>  I really appreciate all the help you guys have given thus far, I'm
>>> learning a lot as this progresses. I'm reading through
>>> https://tools.ietf.org/html/rfc4391 trying to understand IPoIB from top
>>> to bottom.
>>>
>>>  Thanks,
>>>
>>>
>>>  Robert LeBlanc
>>> OIT Infrastructure & Virtualization Engineer
>>> Brigham Young University
>>>
>>>
>>> On Mon, Oct 28, 2013 at 12:53 PM, Coulter, Susan K <skc at lanl.gov> wrote:
>>>
>>>>
>>>>  If you are not seeing any packets leave the ib0 interface, it sounds
>>>> like the emulation layer is not connected to the right device.
>>>>
>>>>  If ib_ipoib kernel module is loaded, and a simple native IB test
>>>> works between those blades - (like ib_read_bw) you need to dig into the
>>>> device layer and insure ipoib is "connected" to the right device.
>>>>
>>>>  Do you have more than 1 IB card?
>>>> What does your modprobe config look like for ipoib?
>>>>
>>>>
>>>>   On Oct 28, 2013, at 12:38 PM, Robert LeBlanc <robert_leblanc at byu.edu>
>>>>   wrote:
>>>>
>>>>  These ESX hosts (2 blade server and 2 rack servers) are booted into a
>>>> CentOS 6.2 Live CD that I built. Right now everything I'm trying to get
>>>> working is CentOS 6.2. All of our other hosts are running ESXi and have
>>>> IPoIB interfaces, but none of them are configured and I'm not trying to get
>>>> those working right now.
>>>>
>>>>  Ideally, we would like our ESX hosts to communicate with each other
>>>> for vMotion and protected VM traffic as well as with our Commvault backup
>>>> servers (Windows) over IPoIB (or Oracle's PVI which is very similar).
>>>>
>>>>
>>>>  Robert LeBlanc
>>>> OIT Infrastructure & Virtualization Engineer
>>>> Brigham Young University
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 12:33 PM, Hal Rosenstock <
>>>> hal.rosenstock at gmail.com> wrote:
>>>>
>>>>> Are those ESXi IPoIB interfaces ? Do some of these work and others not
>>>>> ? Are there normal Linux IPoIB interfaces ? Do they work ?
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 2:24 PM, Robert LeBlanc <
>>>>> robert_leblanc at byu.edu> wrote:
>>>>>
>>>>>> Yes, I can not ping them over the IPoIB interface. It is a very
>>>>>> simple network set-up.
>>>>>>
>>>>>>  desxi003
>>>>>>  8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>> pfifo_fast state UP qlen 256
>>>>>>     link/infiniband
>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>     inet 192.168.9.3/24 brd 192.168.9.255 scope global ib0
>>>>>>     inet6 fe80::f24d:a290:9778:e7d1/64 scope link
>>>>>>        valid_lft forever preferred_lft forever
>>>>>>
>>>>>>  desxi004
>>>>>>  8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>> pfifo_fast state UP qlen 256
>>>>>>     link/infiniband
>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>     inet 192.168.9.4/24 brd 192.168.9.255 scope global ib0
>>>>>>     inet6 fe80::f24d:a290:9778:e715/64 scope link
>>>>>>        valid_lft forever preferred_lft forever
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Robert LeBlanc
>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>> Brigham Young University
>>>>>>
>>>>>>
>>>>>>  On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock <
>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>
>>>>>>> So these 2 hosts have trouble talking IPoIB to each other ?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <
>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>
>>>>>>>> I was just wondering about that. It seems reasonable that the
>>>>>>>> broadcast traffic would go over multicast, but effectively channels would
>>>>>>>> be created for node to node communication, otherwise the entire multicast
>>>>>>>> group would be limited to 10 Gbps (in this instance) for the whole group.
>>>>>>>> That doesn't scale very well.
>>>>>>>>
>>>>>>>>  The things I've read about IPoIB performance tuning seems pretty
>>>>>>>> vague, and the changes most people recommend seem to be already in place on
>>>>>>>> the systems I'm using. Some people said, try using a newer version of
>>>>>>>> Ubuntu, but ultimately, I have very little control over VMware. Once I can
>>>>>>>> get the Linux machines to communicate IPoIB between the racks and blades,
>>>>>>>> then I'm going to turn my attention over to performance optimization. It
>>>>>>>> doesn't seem to make much sense to spend time there when it is not working
>>>>>>>> at all for most machines.
>>>>>>>>
>>>>>>>>  I've done ibtracert between the two nodes, is that what you mean
>>>>>>>> by walking the route?
>>>>>>>>
>>>>>>>>  [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>>>>>>>> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV Mellanox
>>>>>>>> Technologies"
>>>>>>>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
>>>>>>>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>
>>>>>>>>  [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>>>>>>>> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV Mellanox
>>>>>>>> Technologies"
>>>>>>>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
>>>>>>>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>
>>>>>>>>  As you can see, the route is on the same switch, the blades are
>>>>>>>> right next to each other.
>>>>>>>>
>>>>>>>>
>>>>>>>>  Robert LeBlanc
>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>> Brigham Young University
>>>>>>>>
>>>>>>>>
>>>>>>>>  On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>
>>>>>>>>>  Which mystery is explained ? The 10 Gbps is a multicast only
>>>>>>>>> limit and does not apply to unicast. The BW limitation you're seeing is due
>>>>>>>>> to other factors. There's been much written about IPoIB performance.
>>>>>>>>>
>>>>>>>>> If all the MC members are joined and routed, then the IPoIB
>>>>>>>>> connectivity issue is some other issue. Are you sure this is the case ? Did
>>>>>>>>> you walk the route between 2 nodes where you have a connectivity issue ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <
>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Well, that explains one mystery, now I need to figure out why it
>>>>>>>>>> seems the Dell blades are not passing the traffic.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>> Brigham Young University
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>  Yes, that's the IPoIB IPv4 broadcast group for the default
>>>>>>>>>>> (0xffff) partition. 0x80 part of mtu and rate just means "is equal to". mtu
>>>>>>>>>>> 0x04 is 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the defaults.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The info for that MGID is:
>>>>>>>>>>>> MCMemberRecord group dump:
>>>>>>>>>>>>
>>>>>>>>>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>>>>>>>>>                 Mlid....................0xC000
>>>>>>>>>>>>                 Mtu.....................0x84
>>>>>>>>>>>>                 pkey....................0xFFFF
>>>>>>>>>>>>                 Rate....................0x83
>>>>>>>>>>>>                 SL......................0x0
>>>>>>>>>>>>
>>>>>>>>>>>>  I don't understand the MTU and Rate (130 and 131 dec). When I
>>>>>>>>>>>> run iperf between the two hosts over IPoIB in connected mode and MTU 65520.
>>>>>>>>>>>> I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>  saquery -g should show what MGID is mapped to MLID 0xc000
>>>>>>>>>>>>> and the group parameters.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  When you say 10 Gbps max, is that multicast or unicast ?
>>>>>>>>>>>>> That limit is only on the multicast.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well, that can explain why I'm only able to get 10 Gbps max
>>>>>>>>>>>>>> from the two hosts that are working.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  I have tried updn and dnup and they didn't help either. I
>>>>>>>>>>>>>> think the only thing that will help is Automatic Path Migration is it tries
>>>>>>>>>>>>>> very hard to route the alternative LIDs through different systemguids. I
>>>>>>>>>>>>>> suspect it would require re-LIDing everything which would mean an outage.
>>>>>>>>>>>>>> I'm still trying to get answers from Oracle if that is even a possibility.
>>>>>>>>>>>>>> I've tried seeding some of the algorithms with information like root nodes,
>>>>>>>>>>>>>> etc, but none of them worked better.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  The MLID 0xc000 exists and I can see all the nodes joined
>>>>>>>>>>>>>> to the group using saquery. I've checked the route using ibtracert
>>>>>>>>>>>>>> specifying the MLID. The only thing I'm not sure how to check is the group
>>>>>>>>>>>>>> parameters. What tool would I use for that?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Xsigo's SM is not "straight" OpenSM. They have some
>>>>>>>>>>>>>>> proprietary enhancements and it may be based on old vintage of OpenSM. You
>>>>>>>>>>>>>>> will likely need to work with them/Oracle now on issues.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lack of a partitions file does mean default partition and
>>>>>>>>>>>>>>> default rate (10 Gbps) so from what I saw all ports had sufficient rate to
>>>>>>>>>>>>>>> join MC group.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are certain topology requirements for running various
>>>>>>>>>>>>>>> routing algorithms. Did you try updn or dnup ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The key is determining whether the IPoIB broadcast group is
>>>>>>>>>>>>>>> setup correctly. What MLID is the group built on (usually 0xc000) ? What
>>>>>>>>>>>>>>> are the group parameters (rate, MTU) ? Are all members that are running
>>>>>>>>>>>>>>> IPoIB joined ? Is the group routed to all such members ? There are
>>>>>>>>>>>>>>> infiniband-diags for all of this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is using
>>>>>>>>>>>>>>>> minhop. I've loaded the ibnetdiscover output into ibsim and run all the
>>>>>>>>>>>>>>>> different routing algorithms against it with and without scatter ports.
>>>>>>>>>>>>>>>> Minhop had 50% of our hosts running all paths through a single IS5030
>>>>>>>>>>>>>>>> switch (at least the LIDs we need which represent Ethernet and Fibre
>>>>>>>>>>>>>>>> Channel cards the hosts should communicate with). Ftree, dor, and dfsssp
>>>>>>>>>>>>>>>> failed back to minhop, the others routed more paths through the same IS5030
>>>>>>>>>>>>>>>> in some cases increasing our host count with single point of failure to
>>>>>>>>>>>>>>>> 75%.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  As far as I can tell there is no partitions.conf file so
>>>>>>>>>>>>>>>> I assume we are using the default partition. There is an opensm.opts file,
>>>>>>>>>>>>>>>> but it only specifies logging information.
>>>>>>>>>>>>>>>>  # SA database file name
>>>>>>>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  # If TRUE causes OpenSM to dump SA database at the end of
>>>>>>>>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  # The directory to hold the file OpenSM dumps
>>>>>>>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  The SM node is:
>>>>>>>>>>>>>>>>  xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  We do have Switch-X in two of the Dell m1000e chassis but
>>>>>>>>>>>>>>>> the cards, ports 17-32, are FDR10 (the switch may be straight FDR, but I'm
>>>>>>>>>>>>>>>> not 100% sure). The IS5030 are QDR which the Switch-X are connected to, the
>>>>>>>>>>>>>>>> switches in the Xsigo directors are QDR, but the Ethernet and Fibre Channel
>>>>>>>>>>>>>>>> cards are DDR. The DDR cards will not be running IPoIB (at least to my
>>>>>>>>>>>>>>>> knowledge they don't have the ability), only the hosts should be leveraging
>>>>>>>>>>>>>>>> IPoIB. I hope that clears up some of your questions. If you have more, I
>>>>>>>>>>>>>>>> will try to answer them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  What routing algorithm is configured in OpenSM ? What
>>>>>>>>>>>>>>>>> does your partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, I only see QDR and DDR links although you have
>>>>>>>>>>>>>>>>> Switch-X so I assume all FDR ports are connected to slower (QDR) devices. I
>>>>>>>>>>>>>>>>> don't see any FDR-10 ports but maybe they're also connected to QDR ports so
>>>>>>>>>>>>>>>>> show up as QDR in the topology.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether or not
>>>>>>>>>>>>>>>>> they run IPoIB.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Since you guys are amazingly helpful, I thought I would
>>>>>>>>>>>>>>>>>> pick your brains in a new problem.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  We have two Xsigo directors cross connected to four
>>>>>>>>>>>>>>>>>> Mellanox IS5030 switches. Connected to those we have four Dell m1000e
>>>>>>>>>>>>>>>>>> chassis each with two IB switches (two chassis have QDR and two have
>>>>>>>>>>>>>>>>>> FDR10). We have 9 dual-port rack servers connected to the IS5030 switches.
>>>>>>>>>>>>>>>>>> For testing purposes we have an additional Dell m1000e QDR chassis
>>>>>>>>>>>>>>>>>> connected to one Xsigo director and two dual-port FDR10 rack servers
>>>>>>>>>>>>>>>>>> connected to the other Xsigo director.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I can get IPoIB to work between the two test rack
>>>>>>>>>>>>>>>>>> servers connected to the one Xsigo director. But I can not get IPoIB to
>>>>>>>>>>>>>>>>>> work between any blades either right next to each other or to the working
>>>>>>>>>>>>>>>>>> rack servers. I'm using the same exact live CentOS ISO on all four servers.
>>>>>>>>>>>>>>>>>> I've checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I've read about MulticastFDBTop=0xBFFF but I don't know
>>>>>>>>>>>>>>>>>> how to set it and I doubt it would have been set by default.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Anyone have some ideas on troubleshooting steps to try?
>>>>>>>>>>>>>>>>>> I think Google is tired of me asking questions about it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>  _______________________________________________
>>>> Users mailing list
>>>> Users at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>
>>>>
>>>>  ====================================
>>>>
>>>>  Susan Coulter
>>>> HPC-3 Network/Infrastructure
>>>> 505-667-8425
>>>> Increase the Peace...
>>>> An eye for an eye leaves the whole world blind
>>>> ====================================
>>>>
>>>>
>>>
>>>  ====================================
>>>
>>>  Susan Coulter
>>> HPC-3 Network/Infrastructure
>>> 505-667-8425
>>> Increase the Peace...
>>> An eye for an eye leaves the whole world blind
>>> ====================================
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131113/e4fd79a6/attachment.html>