[Users] Weird IPoIB issue
Robert LeBlanc
robert_leblanc at byu.edu
Wed Nov 13 10:46:22 PST 2013
Thanks for all the help so far, this is a great community! I've fed all
this info back to Oracle and I'll have to see what they say.
Thanks,
Robert LeBlanc
OIT Infrastructure & Virtualization Engineer
Brigham Young University
On Wed, Nov 13, 2013 at 11:40 AM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:
> Yes, this is the cause of the issues.
>
> smpdump (and smpquery) merely query (get) and don't set parameters and
> anyhow, the SM would overwrite it when it thought it needed to update it.
> It's an SM and/or firmware issue.
>
>
> On Wed, Nov 13, 2013 at 1:38 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>
>> We are on the latest version of firmware for all of our switches (as of
>> last month). I guess I'll have to check with Oracle and see if they are
>> setting this parameter in their subnet manager. Just to confirm, using
>> smpdump (or similar) to change the value won't do any good because the
>> subnet manager will just change it back?
>>
>> I think this is the cause of the problems, now to get it fixed.
>>
>> Thanks,
>>
>>
>> Robert LeBlanc
>> OIT Infrastructure & Virtualization Engineer
>> Brigham Young University
>>
>>
>> On Wed, Nov 13, 2013 at 11:34 AM, Hal Rosenstock <
>> hal.rosenstock at gmail.com> wrote:
>>
>>> In general, MulticastFDBTop should be 0 or some value above 0xc000.
>>>
>>>
>>> Indicates the upper bound of the range of the multicast
>>>
>>> forwarding table. Packets received with MLIDs greater
>>>
>>> than MulticastFDBTop are considered to be outside the
>>>
>>> range of the Multicast Forwarding Table (see
>>>
>>> 18.2.4.3.3
>>>
>>> Required Multicast Relay on page 1072
>>>
>>> ). A valid MulticastFDBTop
>>>
>>> is less than MulticastFDBCap + 0xC000.
>>>
>>> This component applies only to switches that implement
>>>
>>> the optional multicast forwarding service. A switch
>>>
>>> shall ignore the MulticastFDBTop component if it has
>>>
>>> the value zero. The initial value for MulticastFDBTop
>>>
>>> shall be set to zero. A value of 0xBFFF means there are
>>>
>>> no MulticastForwardingTable entries.
>>> It is set by OpenSM. There is a parameter to disable it's use
>>> (use_mfttop) which can be set to FALSE. This may depend on which OpenSM
>>> version you are running. In order to get out of this state, you may need to
>>> reset any switches which have this parameter set like this.
>>>
>>> Any idea on the firmware versions in your various switches ?
>>>
>>> -- Hal
>>>
>>>
>>> On Wed, Nov 13, 2013 at 1:16 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>>
>>>> Sorry to take so long, I've been busy with other things. Here is the
>>>> output:
>>>>
>>>> [root at desxi003 ~]# smpquery si 52
>>>> # Switch info: Lid 52
>>>> LinearFdbCap:....................49152
>>>> RandomFdbCap:....................0
>>>> McastFdbCap:.....................4096
>>>> LinearFdbTop:....................189
>>>> DefPort:.........................0
>>>> DefMcastPrimPort:................255
>>>> DefMcastNotPrimPort:.............255
>>>> LifeTime:........................18
>>>> StateChange:.....................0
>>>> OptSLtoVLMapping:................1
>>>> LidsPerPort:.....................0
>>>> PartEnforceCap:..................32
>>>> InboundPartEnf:..................1
>>>> OutboundPartEnf:.................1
>>>> FilterRawInbound:................1
>>>> FilterRawOutbound:...............1
>>>> EnhancedPort0:...................0
>>>> MulticastFDBTop:.................0xbfff
>>>> [root at desxi003 ~]# smpquery pi 52 0
>>>> # Port info: Lid 52 port 0
>>>> Mkey:............................0x0000000000000000
>>>> GidPrefix:.......................0xfe80000000000000
>>>> Lid:.............................52
>>>> SMLid:...........................49
>>>> CapMask:.........................0x42500848
>>>> IsTrapSupported
>>>> IsSLMappingSupported
>>>> IsSystemImageGUIDsupported
>>>> IsVendorClassSupported
>>>> IsCapabilityMaskNoticeSupported
>>>> IsClientRegistrationSupported
>>>> IsMulticastFDBTopSupported
>>>> DiagCode:........................0x0000
>>>> MkeyLeasePeriod:.................0
>>>> LocalPort:.......................1
>>>> LinkWidthEnabled:................1X or 4X
>>>> LinkWidthSupported:..............1X or 4X
>>>> LinkWidthActive:.................4X
>>>> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
>>>> LinkState:.......................Active
>>>> PhysLinkState:...................LinkUp
>>>> LinkDownDefState:................Polling
>>>> ProtectBits:.....................0
>>>> LMC:.............................0
>>>> LinkSpeedActive:.................10.0 Gbps
>>>> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
>>>> NeighborMTU:.....................4096
>>>> SMSL:............................0
>>>> VLCap:...........................VL0
>>>> InitType:........................0x00
>>>> VLHighLimit:.....................0
>>>> VLArbHighCap:....................0
>>>> VLArbLowCap:.....................0
>>>> InitReply:.......................0x00
>>>> MtuCap:..........................4096
>>>> VLStallCount:....................0
>>>> HoqLife:.........................0
>>>> OperVLs:.........................VL0
>>>> PartEnforceInb:..................0
>>>> PartEnforceOutb:.................0
>>>> FilterRawInb:....................0
>>>> FilterRawOutb:...................0
>>>> MkeyViolations:..................0
>>>> PkeyViolations:..................0
>>>> QkeyViolations:..................0
>>>> GuidCap:.........................1
>>>> ClientReregister:................0
>>>> McastPkeyTrapSuppressionEnabled:.0
>>>> SubnetTimeout:...................18
>>>> RespTimeVal:.....................20
>>>> LocalPhysErr:....................0
>>>> OverrunErr:......................0
>>>> MaxCreditHint:...................0
>>>> RoundTrip:.......................0
>>>>
>>>> From what I've read in the Mellanox Release
>>>> Notes MultiCastFDBTop=0xBFFF is supposed to discard MC traffic. The
>>>> question is, how do I set this value to something else and what should it
>>>> be set to?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Robert LeBlanc
>>>> OIT Infrastructure & Virtualization Engineer
>>>> Brigham Young University
>>>>
>>>>
>>>> On Wed, Oct 30, 2013 at 12:28 PM, Hal Rosenstock <
>>>> hal.rosenstock at gmail.com> wrote:
>>>>
>>>>> Determine LID of switch (in the below say switch is lid x)
>>>>> Then:
>>>>>
>>>>> smpquery si x
>>>>> (of interest are McastFdbCap and MulticastFDBTop)
>>>>> smpquery pi x 0
>>>>> (of interest is CapMask)
>>>>> ibroute -M x
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 29, 2013 at 3:56 PM, Robert LeBlanc <
>>>>> robert_leblanc at byu.edu> wrote:
>>>>>
>>>>>> Both ports show up in the "saquery MCMR" results with a JoinState of
>>>>>> 0x1.
>>>>>>
>>>>>> How can I dump the parameters of a non-managed switch so that I can
>>>>>> confirm that multicast is not turned off on the Dell chassis IB switches?
>>>>>>
>>>>>>
>>>>>> Robert LeBlanc
>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>> Brigham Young University
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 5:04 PM, Coulter, Susan K <skc at lanl.gov>wrote:
>>>>>>
>>>>>>>
>>>>>>> /sys/class/net should give you the details on your devices, like
>>>>>>> this:
>>>>>>>
>>>>>>> -bash-4.1# cd /sys/class/net
>>>>>>> -bash-4.1# ls -l
>>>>>>> total 0
>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth0 ->
>>>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.0/net/eth0
>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth1 ->
>>>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.1/net/eth1
>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib0 ->
>>>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib0
>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib1 ->
>>>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib1
>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib2 ->
>>>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib2
>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib3 ->
>>>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib3
>>>>>>>
>>>>>>> Then use "lspci | grep Mell" to get the pci device numbers.
>>>>>>>
>>>>>>> 47:00.0 Network controller: Mellanox Technologies MT26428
>>>>>>> [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>>> c7:00.0 Network controller: Mellanox Technologies MT26428 [ConnectX
>>>>>>> VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>>>
>>>>>>> In this example, ib0 and 1 are referencing the device at 47:00.0
>>>>>>> And ib2 and ib3 are referencing the device at c7:00.0
>>>>>>>
>>>>>>> That said, if you only have one card - this is probably not the
>>>>>>> problem.
>>>>>>> Additionally, since the arp requests are being seen going out ib0,
>>>>>>> your emulation appears to be working.
>>>>>>>
>>>>>>> If those arp requests are not being seen on the other end, it
>>>>>>> seems like a problem with the mgids.
>>>>>>> Like maybe the port you are trying to reach is not in the IPoIB
>>>>>>> multicast group?
>>>>>>>
>>>>>>> You can look at all the multicast member records with "saquery
>>>>>>> MCMR".
>>>>>>> Or - you can grep for mcmr_rcv_join_mgrp references in your SM logs …
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Oct 28, 2013, at 1:08 PM, Robert LeBlanc <robert_leblanc at byu.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I can ibping between both hosts just fine.
>>>>>>>
>>>>>>> [root at desxi003 ~]# ibping 0x37
>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.111 ms
>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.179 ms
>>>>>>> ^C
>>>>>>> --- desxi004.(none) (Lid 55) ibping statistics ---
>>>>>>> 4 packets transmitted, 4 received, 0% packet loss, time 3086 ms
>>>>>>> rtt min/avg/max = 0.111/0.167/0.189 ms
>>>>>>>
>>>>>>> [root at desxi004 ~]# ibping 0x2d
>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.156 ms
>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.175 ms
>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.176 ms
>>>>>>> ^C
>>>>>>> --- desxi003.(none) (Lid 45) ibping statistics ---
>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2302 ms
>>>>>>> rtt min/avg/max = 0.156/0.169/0.176 ms
>>>>>>>
>>>>>>> When I do an Ethernet ping to the IPoIB address, tcpdump only
>>>>>>> shows the outgoing ARP request.
>>>>>>>
>>>>>>> [root at desxi003 ~]# tcpdump -i ib0
>>>>>>> tcpdump: verbose output suppressed, use -v or -vv for full protocol
>>>>>>> decode
>>>>>>> listening on ib0, link-type LINUX_SLL (Linux cooked), capture size
>>>>>>> 65535 bytes
>>>>>>> 19:00:08.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>> length 56
>>>>>>> 19:00:09.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>> length 56
>>>>>>> 19:00:10.950307 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>> length 56
>>>>>>>
>>>>>>> Running tcpdump on the rack servers I don't see the ARP request
>>>>>>> there which I should.
>>>>>>>
>>>>>>> From what I've read, ib0 should be mapped to the first port and
>>>>>>> ib1 should be mapped to the second port. We have one IB card with two
>>>>>>> ports. The modprobe is the default installed with the Mellanox drivers.
>>>>>>>
>>>>>>> [root at desxi003 etc]# cat modprobe.d/ib_ipoib.conf
>>>>>>> # install ib_ipoib modprobe --ignore-install ib_ipoib &&
>>>>>>> /sbin/ib_ipoib_sysctl load
>>>>>>> # remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe -r
>>>>>>> --ignore-remove ib_ipoib
>>>>>>> alias ib0 ib_ipoib
>>>>>>> alias ib1 ib_ipoib
>>>>>>>
>>>>>>> Can you give me some pointers on digging into the device layer to
>>>>>>> make sure IPoIB is connected correctly? Would I look in /sys or /proc for
>>>>>>> that?
>>>>>>>
>>>>>>> Dell has not been able to replicate the problem in their
>>>>>>> environment and they only support Red Hat and won't work with my CentOS
>>>>>>> live CD. These blades don't have internal hard drives so it makes it hard
>>>>>>> to install any OS. I don't know if I can engage Mellanox since they build
>>>>>>> the switch hardware and driver stack we are using.
>>>>>>>
>>>>>>> I really appreciate all the help you guys have given thus far, I'm
>>>>>>> learning a lot as this progresses. I'm reading through
>>>>>>> https://tools.ietf.org/html/rfc4391 trying to understand IPoIB from
>>>>>>> top to bottom.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>> Robert LeBlanc
>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>> Brigham Young University
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 12:53 PM, Coulter, Susan K <skc at lanl.gov>wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> If you are not seeing any packets leave the ib0 interface, it
>>>>>>>> sounds like the emulation layer is not connected to the right device.
>>>>>>>>
>>>>>>>> If ib_ipoib kernel module is loaded, and a simple native IB test
>>>>>>>> works between those blades - (like ib_read_bw) you need to dig into the
>>>>>>>> device layer and insure ipoib is "connected" to the right device.
>>>>>>>>
>>>>>>>> Do you have more than 1 IB card?
>>>>>>>> What does your modprobe config look like for ipoib?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 28, 2013, at 12:38 PM, Robert LeBlanc <
>>>>>>>> robert_leblanc at byu.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> These ESX hosts (2 blade server and 2 rack servers) are booted
>>>>>>>> into a CentOS 6.2 Live CD that I built. Right now everything I'm trying to
>>>>>>>> get working is CentOS 6.2. All of our other hosts are running ESXi and have
>>>>>>>> IPoIB interfaces, but none of them are configured and I'm not trying to get
>>>>>>>> those working right now.
>>>>>>>>
>>>>>>>> Ideally, we would like our ESX hosts to communicate with each
>>>>>>>> other for vMotion and protected VM traffic as well as with our Commvault
>>>>>>>> backup servers (Windows) over IPoIB (or Oracle's PVI which is very similar).
>>>>>>>>
>>>>>>>>
>>>>>>>> Robert LeBlanc
>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>> Brigham Young University
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2013 at 12:33 PM, Hal Rosenstock <
>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Are those ESXi IPoIB interfaces ? Do some of these work and others
>>>>>>>>> not ? Are there normal Linux IPoIB interfaces ? Do they work ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 2:24 PM, Robert LeBlanc <
>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I can not ping them over the IPoIB interface. It is a very
>>>>>>>>>> simple network set-up.
>>>>>>>>>>
>>>>>>>>>> desxi003
>>>>>>>>>> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>>>>>> pfifo_fast state UP qlen 256
>>>>>>>>>> link/infiniband
>>>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
>>>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>>> inet 192.168.9.3/24 brd 192.168.9.255 scope global ib0
>>>>>>>>>> inet6 fe80::f24d:a290:9778:e7d1/64 scope link
>>>>>>>>>> valid_lft forever preferred_lft forever
>>>>>>>>>>
>>>>>>>>>> desxi004
>>>>>>>>>> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>>>>>> pfifo_fast state UP qlen 256
>>>>>>>>>> link/infiniband
>>>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
>>>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>>> inet 192.168.9.4/24 brd 192.168.9.255 scope global ib0
>>>>>>>>>> inet6 fe80::f24d:a290:9778:e715/64 scope link
>>>>>>>>>> valid_lft forever preferred_lft forever
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Robert LeBlanc
>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>> Brigham Young University
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock <
>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> So these 2 hosts have trouble talking IPoIB to each other ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <
>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I was just wondering about that. It seems reasonable that the
>>>>>>>>>>>> broadcast traffic would go over multicast, but effectively channels would
>>>>>>>>>>>> be created for node to node communication, otherwise the entire multicast
>>>>>>>>>>>> group would be limited to 10 Gbps (in this instance) for the whole group.
>>>>>>>>>>>> That doesn't scale very well.
>>>>>>>>>>>>
>>>>>>>>>>>> The things I've read about IPoIB performance tuning seems
>>>>>>>>>>>> pretty vague, and the changes most people recommend seem to be already in
>>>>>>>>>>>> place on the systems I'm using. Some people said, try using a newer version
>>>>>>>>>>>> of Ubuntu, but ultimately, I have very little control over VMware. Once I
>>>>>>>>>>>> can get the Linux machines to communicate IPoIB between the racks and
>>>>>>>>>>>> blades, then I'm going to turn my attention over to performance
>>>>>>>>>>>> optimization. It doesn't seem to make much sense to spend time there when
>>>>>>>>>>>> it is not working at all for most machines.
>>>>>>>>>>>>
>>>>>>>>>>>> I've done ibtracert between the two nodes, is that what you
>>>>>>>>>>>> mean by walking the route?
>>>>>>>>>>>>
>>>>>>>>>>>> [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>>>>>>>>>>>> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>>>>> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV
>>>>>>>>>>>> Mellanox Technologies"
>>>>>>>>>>>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
>>>>>>>>>>>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>>>>>
>>>>>>>>>>>> [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>>>>>>>>>>>> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>>>>> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV
>>>>>>>>>>>> Mellanox Technologies"
>>>>>>>>>>>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
>>>>>>>>>>>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>>>>>
>>>>>>>>>>>> As you can see, the route is on the same switch, the blades
>>>>>>>>>>>> are right next to each other.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Which mystery is explained ? The 10 Gbps is a multicast only
>>>>>>>>>>>>> limit and does not apply to unicast. The BW limitation you're seeing is due
>>>>>>>>>>>>> to other factors. There's been much written about IPoIB performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If all the MC members are joined and routed, then the IPoIB
>>>>>>>>>>>>> connectivity issue is some other issue. Are you sure this is the case ? Did
>>>>>>>>>>>>> you walk the route between 2 nodes where you have a connectivity issue ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <
>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well, that explains one mystery, now I need to figure out why
>>>>>>>>>>>>>> it seems the Dell blades are not passing the traffic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, that's the IPoIB IPv4 broadcast group for the default
>>>>>>>>>>>>>>> (0xffff) partition. 0x80 part of mtu and rate just means "is equal to". mtu
>>>>>>>>>>>>>>> 0x04 is 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the defaults.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The info for that MGID is:
>>>>>>>>>>>>>>>> MCMemberRecord group dump:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>>>>>>>>>>>>> Mlid....................0xC000
>>>>>>>>>>>>>>>> Mtu.....................0x84
>>>>>>>>>>>>>>>> pkey....................0xFFFF
>>>>>>>>>>>>>>>> Rate....................0x83
>>>>>>>>>>>>>>>> SL......................0x0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't understand the MTU and Rate (130 and 131 dec).
>>>>>>>>>>>>>>>> When I run iperf between the two hosts over IPoIB in connected mode and MTU
>>>>>>>>>>>>>>>> 65520. I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> saquery -g should show what MGID is mapped to MLID
>>>>>>>>>>>>>>>>> 0xc000 and the group parameters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When you say 10 Gbps max, is that multicast or unicast ?
>>>>>>>>>>>>>>>>> That limit is only on the multicast.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Well, that can explain why I'm only able to get 10 Gbps
>>>>>>>>>>>>>>>>>> max from the two hosts that are working.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have tried updn and dnup and they didn't help either.
>>>>>>>>>>>>>>>>>> I think the only thing that will help is Automatic Path Migration is it
>>>>>>>>>>>>>>>>>> tries very hard to route the alternative LIDs through different
>>>>>>>>>>>>>>>>>> systemguids. I suspect it would require re-LIDing everything which would
>>>>>>>>>>>>>>>>>> mean an outage. I'm still trying to get answers from Oracle if that is even
>>>>>>>>>>>>>>>>>> a possibility. I've tried seeding some of the algorithms with information
>>>>>>>>>>>>>>>>>> like root nodes, etc, but none of them worked better.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The MLID 0xc000 exists and I can see all the nodes
>>>>>>>>>>>>>>>>>> joined to the group using saquery. I've checked the route using ibtracert
>>>>>>>>>>>>>>>>>> specifying the MLID. The only thing I'm not sure how to check is the group
>>>>>>>>>>>>>>>>>> parameters. What tool would I use for that?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Xsigo's SM is not "straight" OpenSM. They have some
>>>>>>>>>>>>>>>>>>> proprietary enhancements and it may be based on old vintage of OpenSM. You
>>>>>>>>>>>>>>>>>>> will likely need to work with them/Oracle now on issues.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Lack of a partitions file does mean default partition
>>>>>>>>>>>>>>>>>>> and default rate (10 Gbps) so from what I saw all ports had sufficient rate
>>>>>>>>>>>>>>>>>>> to join MC group.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are certain topology requirements for running
>>>>>>>>>>>>>>>>>>> various routing algorithms. Did you try updn or dnup ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The key is determining whether the IPoIB broadcast group
>>>>>>>>>>>>>>>>>>> is setup correctly. What MLID is the group built on (usually 0xc000) ? What
>>>>>>>>>>>>>>>>>>> are the group parameters (rate, MTU) ? Are all members that are running
>>>>>>>>>>>>>>>>>>> IPoIB joined ? Is the group routed to all such members ? There are
>>>>>>>>>>>>>>>>>>> infiniband-diags for all of this.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is
>>>>>>>>>>>>>>>>>>>> using minhop. I've loaded the ibnetdiscover output into ibsim and run all
>>>>>>>>>>>>>>>>>>>> the different routing algorithms against it with and without scatter ports.
>>>>>>>>>>>>>>>>>>>> Minhop had 50% of our hosts running all paths through a single IS5030
>>>>>>>>>>>>>>>>>>>> switch (at least the LIDs we need which represent Ethernet and Fibre
>>>>>>>>>>>>>>>>>>>> Channel cards the hosts should communicate with). Ftree, dor, and dfsssp
>>>>>>>>>>>>>>>>>>>> failed back to minhop, the others routed more paths through the same IS5030
>>>>>>>>>>>>>>>>>>>> in some cases increasing our host count with single point of failure to
>>>>>>>>>>>>>>>>>>>> 75%.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> As far as I can tell there is no partitions.conf file
>>>>>>>>>>>>>>>>>>>> so I assume we are using the default partition. There is an opensm.opts
>>>>>>>>>>>>>>>>>>>> file, but it only specifies logging information.
>>>>>>>>>>>>>>>>>>>> # SA database file name
>>>>>>>>>>>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> # If TRUE causes OpenSM to dump SA database at the
>>>>>>>>>>>>>>>>>>>> end of
>>>>>>>>>>>>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>>>>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> # The directory to hold the file OpenSM dumps
>>>>>>>>>>>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The SM node is:
>>>>>>>>>>>>>>>>>>>> xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>>>>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We do have Switch-X in two of the Dell m1000e chassis
>>>>>>>>>>>>>>>>>>>> but the cards, ports 17-32, are FDR10 (the switch may be straight FDR, but
>>>>>>>>>>>>>>>>>>>> I'm not 100% sure). The IS5030 are QDR which the Switch-X are connected to,
>>>>>>>>>>>>>>>>>>>> the switches in the Xsigo directors are QDR, but the Ethernet and Fibre
>>>>>>>>>>>>>>>>>>>> Channel cards are DDR. The DDR cards will not be running IPoIB (at least to
>>>>>>>>>>>>>>>>>>>> my knowledge they don't have the ability), only the hosts should be
>>>>>>>>>>>>>>>>>>>> leveraging IPoIB. I hope that clears up some of your questions. If you have
>>>>>>>>>>>>>>>>>>>> more, I will try to answer them.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> What routing algorithm is configured in OpenSM ?
>>>>>>>>>>>>>>>>>>>>> What does your partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also, I only see QDR and DDR links although you have
>>>>>>>>>>>>>>>>>>>>> Switch-X so I assume all FDR ports are connected to slower (QDR) devices. I
>>>>>>>>>>>>>>>>>>>>> don't see any FDR-10 ports but maybe they're also connected to QDR ports so
>>>>>>>>>>>>>>>>>>>>> show up as QDR in the topology.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether or
>>>>>>>>>>>>>>>>>>>>> not they run IPoIB.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Since you guys are amazingly helpful, I thought I
>>>>>>>>>>>>>>>>>>>>>> would pick your brains in a new problem.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> We have two Xsigo directors cross connected to four
>>>>>>>>>>>>>>>>>>>>>> Mellanox IS5030 switches. Connected to those we have four Dell m1000e
>>>>>>>>>>>>>>>>>>>>>> chassis each with two IB switches (two chassis have QDR and two have
>>>>>>>>>>>>>>>>>>>>>> FDR10). We have 9 dual-port rack servers connected to the IS5030 switches.
>>>>>>>>>>>>>>>>>>>>>> For testing purposes we have an additional Dell m1000e QDR chassis
>>>>>>>>>>>>>>>>>>>>>> connected to one Xsigo director and two dual-port FDR10 rack servers
>>>>>>>>>>>>>>>>>>>>>> connected to the other Xsigo director.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I can get IPoIB to work between the two test rack
>>>>>>>>>>>>>>>>>>>>>> servers connected to the one Xsigo director. But I can not get IPoIB to
>>>>>>>>>>>>>>>>>>>>>> work between any blades either right next to each other or to the working
>>>>>>>>>>>>>>>>>>>>>> rack servers. I'm using the same exact live CentOS ISO on all four servers.
>>>>>>>>>>>>>>>>>>>>>> I've checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>>>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>>>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>>>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I've read about MulticastFDBTop=0xBFFF but I don't
>>>>>>>>>>>>>>>>>>>>>> know how to set it and I doubt it would have been set by default.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Anyone have some ideas on troubleshooting steps to
>>>>>>>>>>>>>>>>>>>>>> try? I think Google is tired of me asking questions about it.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at lists.openfabrics.org
>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>
>>>>>>>>
>>>>>>>> ====================================
>>>>>>>>
>>>>>>>> Susan Coulter
>>>>>>>> HPC-3 Network/Infrastructure
>>>>>>>> 505-667-8425
>>>>>>>> Increase the Peace...
>>>>>>>> An eye for an eye leaves the whole world blind
>>>>>>>> ====================================
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ====================================
>>>>>>>
>>>>>>> Susan Coulter
>>>>>>> HPC-3 Network/Infrastructure
>>>>>>> 505-667-8425
>>>>>>> Increase the Peace...
>>>>>>> An eye for an eye leaves the whole world blind
>>>>>>> ====================================
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131113/e1af2e04/attachment.html>
More information about the Users
mailing list