[Users] Weird IPoIB issue
Hal Rosenstock
hal.rosenstock at gmail.com
Wed Nov 13 10:52:07 PST 2013
What's the latest firmware version ?
Can you determine the firmware version of the switches ? vendstat -N
<switch lid> might work to show this.
This is important...
Thanks.
-- Hal
On Wed, Nov 13, 2013 at 1:46 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
> Thanks for all the help so far, this is a great community! I've fed all
> this info back to Oracle and I'll have to see what they say.
>
> Thanks,
>
>
> Robert LeBlanc
> OIT Infrastructure & Virtualization Engineer
> Brigham Young University
>
>
> On Wed, Nov 13, 2013 at 11:40 AM, Hal Rosenstock <hal.rosenstock at gmail.com
> > wrote:
>
>> Yes, this is the cause of the issues.
>>
>> smpdump (and smpquery) merely query (get) and don't set parameters and
>> anyhow, the SM would overwrite it when it thought it needed to update it.
>> It's an SM and/or firmware issue.
>>
>>
>> On Wed, Nov 13, 2013 at 1:38 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>
>>> We are on the latest version of firmware for all of our switches (as of
>>> last month). I guess I'll have to check with Oracle and see if they are
>>> setting this parameter in their subnet manager. Just to confirm, using
>>> smpdump (or similar) to change the value won't do any good because the
>>> subnet manager will just change it back?
>>>
>>> I think this is the cause of the problems, now to get it fixed.
>>>
>>> Thanks,
>>>
>>>
>>> Robert LeBlanc
>>> OIT Infrastructure & Virtualization Engineer
>>> Brigham Young University
>>>
>>>
>>> On Wed, Nov 13, 2013 at 11:34 AM, Hal Rosenstock <
>>> hal.rosenstock at gmail.com> wrote:
>>>
>>>> In general, MulticastFDBTop should be 0 or some value above 0xc000.
>>>>
>>>>
>>>> Indicates the upper bound of the range of the multicast
>>>>
>>>> forwarding table. Packets received with MLIDs greater
>>>>
>>>> than MulticastFDBTop are considered to be outside the
>>>>
>>>> range of the Multicast Forwarding Table (see
>>>>
>>>> 18.2.4.3.3
>>>>
>>>> Required Multicast Relay on page 1072
>>>>
>>>> ). A valid MulticastFDBTop
>>>>
>>>> is less than MulticastFDBCap + 0xC000.
>>>>
>>>> This component applies only to switches that implement
>>>>
>>>> the optional multicast forwarding service. A switch
>>>>
>>>> shall ignore the MulticastFDBTop component if it has
>>>>
>>>> the value zero. The initial value for MulticastFDBTop
>>>>
>>>> shall be set to zero. A value of 0xBFFF means there are
>>>>
>>>> no MulticastForwardingTable entries.
>>>> It is set by OpenSM. There is a parameter to disable it's use
>>>> (use_mfttop) which can be set to FALSE. This may depend on which OpenSM
>>>> version you are running. In order to get out of this state, you may need to
>>>> reset any switches which have this parameter set like this.
>>>>
>>>> Any idea on the firmware versions in your various switches ?
>>>>
>>>> -- Hal
>>>>
>>>>
>>>> On Wed, Nov 13, 2013 at 1:16 PM, Robert LeBlanc <robert_leblanc at byu.edu
>>>> > wrote:
>>>>
>>>>> Sorry to take so long, I've been busy with other things. Here is the
>>>>> output:
>>>>>
>>>>> [root at desxi003 ~]# smpquery si 52
>>>>> # Switch info: Lid 52
>>>>> LinearFdbCap:....................49152
>>>>> RandomFdbCap:....................0
>>>>> McastFdbCap:.....................4096
>>>>> LinearFdbTop:....................189
>>>>> DefPort:.........................0
>>>>> DefMcastPrimPort:................255
>>>>> DefMcastNotPrimPort:.............255
>>>>> LifeTime:........................18
>>>>> StateChange:.....................0
>>>>> OptSLtoVLMapping:................1
>>>>> LidsPerPort:.....................0
>>>>> PartEnforceCap:..................32
>>>>> InboundPartEnf:..................1
>>>>> OutboundPartEnf:.................1
>>>>> FilterRawInbound:................1
>>>>> FilterRawOutbound:...............1
>>>>> EnhancedPort0:...................0
>>>>> MulticastFDBTop:.................0xbfff
>>>>> [root at desxi003 ~]# smpquery pi 52 0
>>>>> # Port info: Lid 52 port 0
>>>>> Mkey:............................0x0000000000000000
>>>>> GidPrefix:.......................0xfe80000000000000
>>>>> Lid:.............................52
>>>>> SMLid:...........................49
>>>>> CapMask:.........................0x42500848
>>>>> IsTrapSupported
>>>>> IsSLMappingSupported
>>>>> IsSystemImageGUIDsupported
>>>>> IsVendorClassSupported
>>>>> IsCapabilityMaskNoticeSupported
>>>>> IsClientRegistrationSupported
>>>>> IsMulticastFDBTopSupported
>>>>> DiagCode:........................0x0000
>>>>> MkeyLeasePeriod:.................0
>>>>> LocalPort:.......................1
>>>>> LinkWidthEnabled:................1X or 4X
>>>>> LinkWidthSupported:..............1X or 4X
>>>>> LinkWidthActive:.................4X
>>>>> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
>>>>> LinkState:.......................Active
>>>>> PhysLinkState:...................LinkUp
>>>>> LinkDownDefState:................Polling
>>>>> ProtectBits:.....................0
>>>>> LMC:.............................0
>>>>> LinkSpeedActive:.................10.0 Gbps
>>>>> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
>>>>> NeighborMTU:.....................4096
>>>>> SMSL:............................0
>>>>> VLCap:...........................VL0
>>>>> InitType:........................0x00
>>>>> VLHighLimit:.....................0
>>>>> VLArbHighCap:....................0
>>>>> VLArbLowCap:.....................0
>>>>> InitReply:.......................0x00
>>>>> MtuCap:..........................4096
>>>>> VLStallCount:....................0
>>>>> HoqLife:.........................0
>>>>> OperVLs:.........................VL0
>>>>> PartEnforceInb:..................0
>>>>> PartEnforceOutb:.................0
>>>>> FilterRawInb:....................0
>>>>> FilterRawOutb:...................0
>>>>> MkeyViolations:..................0
>>>>> PkeyViolations:..................0
>>>>> QkeyViolations:..................0
>>>>> GuidCap:.........................1
>>>>> ClientReregister:................0
>>>>> McastPkeyTrapSuppressionEnabled:.0
>>>>> SubnetTimeout:...................18
>>>>> RespTimeVal:.....................20
>>>>> LocalPhysErr:....................0
>>>>> OverrunErr:......................0
>>>>> MaxCreditHint:...................0
>>>>> RoundTrip:.......................0
>>>>>
>>>>> From what I've read in the Mellanox Release
>>>>> Notes MultiCastFDBTop=0xBFFF is supposed to discard MC traffic. The
>>>>> question is, how do I set this value to something else and what should it
>>>>> be set to?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Robert LeBlanc
>>>>> OIT Infrastructure & Virtualization Engineer
>>>>> Brigham Young University
>>>>>
>>>>>
>>>>> On Wed, Oct 30, 2013 at 12:28 PM, Hal Rosenstock <
>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>
>>>>>> Determine LID of switch (in the below say switch is lid x)
>>>>>> Then:
>>>>>>
>>>>>> smpquery si x
>>>>>> (of interest are McastFdbCap and MulticastFDBTop)
>>>>>> smpquery pi x 0
>>>>>> (of interest is CapMask)
>>>>>> ibroute -M x
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 29, 2013 at 3:56 PM, Robert LeBlanc <
>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>
>>>>>>> Both ports show up in the "saquery MCMR" results with a JoinState of
>>>>>>> 0x1.
>>>>>>>
>>>>>>> How can I dump the parameters of a non-managed switch so that I can
>>>>>>> confirm that multicast is not turned off on the Dell chassis IB switches?
>>>>>>>
>>>>>>>
>>>>>>> Robert LeBlanc
>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>> Brigham Young University
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 5:04 PM, Coulter, Susan K <skc at lanl.gov>wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> /sys/class/net should give you the details on your devices, like
>>>>>>>> this:
>>>>>>>>
>>>>>>>> -bash-4.1# cd /sys/class/net
>>>>>>>> -bash-4.1# ls -l
>>>>>>>> total 0
>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth0 ->
>>>>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.0/net/eth0
>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth1 ->
>>>>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.1/net/eth1
>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib0 ->
>>>>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib0
>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib1 ->
>>>>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib1
>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib2 ->
>>>>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib2
>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib3 ->
>>>>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib3
>>>>>>>>
>>>>>>>> Then use "lspci | grep Mell" to get the pci device numbers.
>>>>>>>>
>>>>>>>> 47:00.0 Network controller: Mellanox Technologies MT26428
>>>>>>>> [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>>>> c7:00.0 Network controller: Mellanox Technologies MT26428 [ConnectX
>>>>>>>> VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>>>>
>>>>>>>> In this example, ib0 and 1 are referencing the device at 47:00.0
>>>>>>>> And ib2 and ib3 are referencing the device at c7:00.0
>>>>>>>>
>>>>>>>> That said, if you only have one card - this is probably not the
>>>>>>>> problem.
>>>>>>>> Additionally, since the arp requests are being seen going out ib0,
>>>>>>>> your emulation appears to be working.
>>>>>>>>
>>>>>>>> If those arp requests are not being seen on the other end, it
>>>>>>>> seems like a problem with the mgids.
>>>>>>>> Like maybe the port you are trying to reach is not in the IPoIB
>>>>>>>> multicast group?
>>>>>>>>
>>>>>>>> You can look at all the multicast member records with "saquery
>>>>>>>> MCMR".
>>>>>>>> Or - you can grep for mcmr_rcv_join_mgrp references in your SM logs
>>>>>>>> …
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 28, 2013, at 1:08 PM, Robert LeBlanc <
>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>
>>>>>>>> I can ibping between both hosts just fine.
>>>>>>>>
>>>>>>>> [root at desxi003 ~]# ibping 0x37
>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.111 ms
>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.179 ms
>>>>>>>> ^C
>>>>>>>> --- desxi004.(none) (Lid 55) ibping statistics ---
>>>>>>>> 4 packets transmitted, 4 received, 0% packet loss, time 3086 ms
>>>>>>>> rtt min/avg/max = 0.111/0.167/0.189 ms
>>>>>>>>
>>>>>>>> [root at desxi004 ~]# ibping 0x2d
>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.156 ms
>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.175 ms
>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.176 ms
>>>>>>>> ^C
>>>>>>>> --- desxi003.(none) (Lid 45) ibping statistics ---
>>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2302 ms
>>>>>>>> rtt min/avg/max = 0.156/0.169/0.176 ms
>>>>>>>>
>>>>>>>> When I do an Ethernet ping to the IPoIB address, tcpdump only
>>>>>>>> shows the outgoing ARP request.
>>>>>>>>
>>>>>>>> [root at desxi003 ~]# tcpdump -i ib0
>>>>>>>> tcpdump: verbose output suppressed, use -v or -vv for full protocol
>>>>>>>> decode
>>>>>>>> listening on ib0, link-type LINUX_SLL (Linux cooked), capture size
>>>>>>>> 65535 bytes
>>>>>>>> 19:00:08.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>>> length 56
>>>>>>>> 19:00:09.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>>> length 56
>>>>>>>> 19:00:10.950307 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>>> length 56
>>>>>>>>
>>>>>>>> Running tcpdump on the rack servers I don't see the ARP request
>>>>>>>> there which I should.
>>>>>>>>
>>>>>>>> From what I've read, ib0 should be mapped to the first port and
>>>>>>>> ib1 should be mapped to the second port. We have one IB card with two
>>>>>>>> ports. The modprobe is the default installed with the Mellanox drivers.
>>>>>>>>
>>>>>>>> [root at desxi003 etc]# cat modprobe.d/ib_ipoib.conf
>>>>>>>> # install ib_ipoib modprobe --ignore-install ib_ipoib &&
>>>>>>>> /sbin/ib_ipoib_sysctl load
>>>>>>>> # remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe -r
>>>>>>>> --ignore-remove ib_ipoib
>>>>>>>> alias ib0 ib_ipoib
>>>>>>>> alias ib1 ib_ipoib
>>>>>>>>
>>>>>>>> Can you give me some pointers on digging into the device layer to
>>>>>>>> make sure IPoIB is connected correctly? Would I look in /sys or /proc for
>>>>>>>> that?
>>>>>>>>
>>>>>>>> Dell has not been able to replicate the problem in their
>>>>>>>> environment and they only support Red Hat and won't work with my CentOS
>>>>>>>> live CD. These blades don't have internal hard drives so it makes it hard
>>>>>>>> to install any OS. I don't know if I can engage Mellanox since they build
>>>>>>>> the switch hardware and driver stack we are using.
>>>>>>>>
>>>>>>>> I really appreciate all the help you guys have given thus far,
>>>>>>>> I'm learning a lot as this progresses. I'm reading through
>>>>>>>> https://tools.ietf.org/html/rfc4391 trying to understand IPoIB
>>>>>>>> from top to bottom.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>> Robert LeBlanc
>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>> Brigham Young University
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2013 at 12:53 PM, Coulter, Susan K <skc at lanl.gov>wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you are not seeing any packets leave the ib0 interface, it
>>>>>>>>> sounds like the emulation layer is not connected to the right device.
>>>>>>>>>
>>>>>>>>> If ib_ipoib kernel module is loaded, and a simple native IB test
>>>>>>>>> works between those blades - (like ib_read_bw) you need to dig into the
>>>>>>>>> device layer and insure ipoib is "connected" to the right device.
>>>>>>>>>
>>>>>>>>> Do you have more than 1 IB card?
>>>>>>>>> What does your modprobe config look like for ipoib?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Oct 28, 2013, at 12:38 PM, Robert LeBlanc <
>>>>>>>>> robert_leblanc at byu.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> These ESX hosts (2 blade server and 2 rack servers) are booted
>>>>>>>>> into a CentOS 6.2 Live CD that I built. Right now everything I'm trying to
>>>>>>>>> get working is CentOS 6.2. All of our other hosts are running ESXi and have
>>>>>>>>> IPoIB interfaces, but none of them are configured and I'm not trying to get
>>>>>>>>> those working right now.
>>>>>>>>>
>>>>>>>>> Ideally, we would like our ESX hosts to communicate with each
>>>>>>>>> other for vMotion and protected VM traffic as well as with our Commvault
>>>>>>>>> backup servers (Windows) over IPoIB (or Oracle's PVI which is very similar).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Robert LeBlanc
>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>> Brigham Young University
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 12:33 PM, Hal Rosenstock <
>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Are those ESXi IPoIB interfaces ? Do some of these work and
>>>>>>>>>> others not ? Are there normal Linux IPoIB interfaces ? Do they work ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 28, 2013 at 2:24 PM, Robert LeBlanc <
>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, I can not ping them over the IPoIB interface. It is a very
>>>>>>>>>>> simple network set-up.
>>>>>>>>>>>
>>>>>>>>>>> desxi003
>>>>>>>>>>> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>>>>>>> pfifo_fast state UP qlen 256
>>>>>>>>>>> link/infiniband
>>>>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
>>>>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>>>> inet 192.168.9.3/24 brd 192.168.9.255 scope global ib0
>>>>>>>>>>> inet6 fe80::f24d:a290:9778:e7d1/64 scope link
>>>>>>>>>>> valid_lft forever preferred_lft forever
>>>>>>>>>>>
>>>>>>>>>>> desxi004
>>>>>>>>>>> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>>>>>>> pfifo_fast state UP qlen 256
>>>>>>>>>>> link/infiniband
>>>>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
>>>>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>>>> inet 192.168.9.4/24 brd 192.168.9.255 scope global ib0
>>>>>>>>>>> inet6 fe80::f24d:a290:9778:e715/64 scope link
>>>>>>>>>>> valid_lft forever preferred_lft forever
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock <
>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> So these 2 hosts have trouble talking IPoIB to each other ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <
>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I was just wondering about that. It seems reasonable that the
>>>>>>>>>>>>> broadcast traffic would go over multicast, but effectively channels would
>>>>>>>>>>>>> be created for node to node communication, otherwise the entire multicast
>>>>>>>>>>>>> group would be limited to 10 Gbps (in this instance) for the whole group.
>>>>>>>>>>>>> That doesn't scale very well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The things I've read about IPoIB performance tuning seems
>>>>>>>>>>>>> pretty vague, and the changes most people recommend seem to be already in
>>>>>>>>>>>>> place on the systems I'm using. Some people said, try using a newer version
>>>>>>>>>>>>> of Ubuntu, but ultimately, I have very little control over VMware. Once I
>>>>>>>>>>>>> can get the Linux machines to communicate IPoIB between the racks and
>>>>>>>>>>>>> blades, then I'm going to turn my attention over to performance
>>>>>>>>>>>>> optimization. It doesn't seem to make much sense to spend time there when
>>>>>>>>>>>>> it is not working at all for most machines.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've done ibtracert between the two nodes, is that what you
>>>>>>>>>>>>> mean by walking the route?
>>>>>>>>>>>>>
>>>>>>>>>>>>> [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>>>>>>>>>>>>> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>>>>>> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV
>>>>>>>>>>>>> Mellanox Technologies"
>>>>>>>>>>>>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
>>>>>>>>>>>>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>>>>>>
>>>>>>>>>>>>> [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>>>>>>>>>>>>> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>>>>>> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV
>>>>>>>>>>>>> Mellanox Technologies"
>>>>>>>>>>>>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
>>>>>>>>>>>>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you can see, the route is on the same switch, the blades
>>>>>>>>>>>>> are right next to each other.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Which mystery is explained ? The 10 Gbps is a multicast
>>>>>>>>>>>>>> only limit and does not apply to unicast. The BW limitation you're seeing
>>>>>>>>>>>>>> is due to other factors. There's been much written about IPoIB performance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If all the MC members are joined and routed, then the IPoIB
>>>>>>>>>>>>>> connectivity issue is some other issue. Are you sure this is the case ? Did
>>>>>>>>>>>>>> you walk the route between 2 nodes where you have a connectivity issue ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <
>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Well, that explains one mystery, now I need to figure out
>>>>>>>>>>>>>>> why it seems the Dell blades are not passing the traffic.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, that's the IPoIB IPv4 broadcast group for the
>>>>>>>>>>>>>>>> default (0xffff) partition. 0x80 part of mtu and rate just means "is equal
>>>>>>>>>>>>>>>> to". mtu 0x04 is 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the
>>>>>>>>>>>>>>>> defaults.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The info for that MGID is:
>>>>>>>>>>>>>>>>> MCMemberRecord group dump:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>>>>>>>>>>>>>> Mlid....................0xC000
>>>>>>>>>>>>>>>>> Mtu.....................0x84
>>>>>>>>>>>>>>>>> pkey....................0xFFFF
>>>>>>>>>>>>>>>>> Rate....................0x83
>>>>>>>>>>>>>>>>> SL......................0x0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't understand the MTU and Rate (130 and 131 dec).
>>>>>>>>>>>>>>>>> When I run iperf between the two hosts over IPoIB in connected mode and MTU
>>>>>>>>>>>>>>>>> 65520. I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> saquery -g should show what MGID is mapped to MLID
>>>>>>>>>>>>>>>>>> 0xc000 and the group parameters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When you say 10 Gbps max, is that multicast or unicast
>>>>>>>>>>>>>>>>>> ? That limit is only on the multicast.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Well, that can explain why I'm only able to get 10 Gbps
>>>>>>>>>>>>>>>>>>> max from the two hosts that are working.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have tried updn and dnup and they didn't help
>>>>>>>>>>>>>>>>>>> either. I think the only thing that will help is Automatic Path Migration
>>>>>>>>>>>>>>>>>>> is it tries very hard to route the alternative LIDs through different
>>>>>>>>>>>>>>>>>>> systemguids. I suspect it would require re-LIDing everything which would
>>>>>>>>>>>>>>>>>>> mean an outage. I'm still trying to get answers from Oracle if that is even
>>>>>>>>>>>>>>>>>>> a possibility. I've tried seeding some of the algorithms with information
>>>>>>>>>>>>>>>>>>> like root nodes, etc, but none of them worked better.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The MLID 0xc000 exists and I can see all the nodes
>>>>>>>>>>>>>>>>>>> joined to the group using saquery. I've checked the route using ibtracert
>>>>>>>>>>>>>>>>>>> specifying the MLID. The only thing I'm not sure how to check is the group
>>>>>>>>>>>>>>>>>>> parameters. What tool would I use for that?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Xsigo's SM is not "straight" OpenSM. They have some
>>>>>>>>>>>>>>>>>>>> proprietary enhancements and it may be based on old vintage of OpenSM. You
>>>>>>>>>>>>>>>>>>>> will likely need to work with them/Oracle now on issues.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Lack of a partitions file does mean default partition
>>>>>>>>>>>>>>>>>>>> and default rate (10 Gbps) so from what I saw all ports had sufficient rate
>>>>>>>>>>>>>>>>>>>> to join MC group.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> There are certain topology requirements for running
>>>>>>>>>>>>>>>>>>>> various routing algorithms. Did you try updn or dnup ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The key is determining whether the IPoIB broadcast
>>>>>>>>>>>>>>>>>>>> group is setup correctly. What MLID is the group built on (usually 0xc000)
>>>>>>>>>>>>>>>>>>>> ? What are the group parameters (rate, MTU) ? Are all members that are
>>>>>>>>>>>>>>>>>>>> running IPoIB joined ? Is the group routed to all such members ? There are
>>>>>>>>>>>>>>>>>>>> infiniband-diags for all of this.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is
>>>>>>>>>>>>>>>>>>>>> using minhop. I've loaded the ibnetdiscover output into ibsim and run all
>>>>>>>>>>>>>>>>>>>>> the different routing algorithms against it with and without scatter ports.
>>>>>>>>>>>>>>>>>>>>> Minhop had 50% of our hosts running all paths through a single IS5030
>>>>>>>>>>>>>>>>>>>>> switch (at least the LIDs we need which represent Ethernet and Fibre
>>>>>>>>>>>>>>>>>>>>> Channel cards the hosts should communicate with). Ftree, dor, and dfsssp
>>>>>>>>>>>>>>>>>>>>> failed back to minhop, the others routed more paths through the same IS5030
>>>>>>>>>>>>>>>>>>>>> in some cases increasing our host count with single point of failure to
>>>>>>>>>>>>>>>>>>>>> 75%.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> As far as I can tell there is no partitions.conf
>>>>>>>>>>>>>>>>>>>>> file so I assume we are using the default partition. There is an
>>>>>>>>>>>>>>>>>>>>> opensm.opts file, but it only specifies logging information.
>>>>>>>>>>>>>>>>>>>>> # SA database file name
>>>>>>>>>>>>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> # If TRUE causes OpenSM to dump SA database at the
>>>>>>>>>>>>>>>>>>>>> end of
>>>>>>>>>>>>>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>>>>>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> # The directory to hold the file OpenSM dumps
>>>>>>>>>>>>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The SM node is:
>>>>>>>>>>>>>>>>>>>>> xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>>>>>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> We do have Switch-X in two of the Dell m1000e
>>>>>>>>>>>>>>>>>>>>> chassis but the cards, ports 17-32, are FDR10 (the switch may be straight
>>>>>>>>>>>>>>>>>>>>> FDR, but I'm not 100% sure). The IS5030 are QDR which the Switch-X are
>>>>>>>>>>>>>>>>>>>>> connected to, the switches in the Xsigo directors are QDR, but the Ethernet
>>>>>>>>>>>>>>>>>>>>> and Fibre Channel cards are DDR. The DDR cards will not be running IPoIB
>>>>>>>>>>>>>>>>>>>>> (at least to my knowledge they don't have the ability), only the hosts
>>>>>>>>>>>>>>>>>>>>> should be leveraging IPoIB. I hope that clears up some of your questions.
>>>>>>>>>>>>>>>>>>>>> If you have more, I will try to answer them.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> What routing algorithm is configured in OpenSM ?
>>>>>>>>>>>>>>>>>>>>>> What does your partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Also, I only see QDR and DDR links although you have
>>>>>>>>>>>>>>>>>>>>>> Switch-X so I assume all FDR ports are connected to slower (QDR) devices. I
>>>>>>>>>>>>>>>>>>>>>> don't see any FDR-10 ports but maybe they're also connected to QDR ports so
>>>>>>>>>>>>>>>>>>>>>> show up as QDR in the topology.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether
>>>>>>>>>>>>>>>>>>>>>> or not they run IPoIB.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Since you guys are amazingly helpful, I thought I
>>>>>>>>>>>>>>>>>>>>>>> would pick your brains in a new problem.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> We have two Xsigo directors cross connected to
>>>>>>>>>>>>>>>>>>>>>>> four Mellanox IS5030 switches. Connected to those we have four Dell m1000e
>>>>>>>>>>>>>>>>>>>>>>> chassis each with two IB switches (two chassis have QDR and two have
>>>>>>>>>>>>>>>>>>>>>>> FDR10). We have 9 dual-port rack servers connected to the IS5030 switches.
>>>>>>>>>>>>>>>>>>>>>>> For testing purposes we have an additional Dell m1000e QDR chassis
>>>>>>>>>>>>>>>>>>>>>>> connected to one Xsigo director and two dual-port FDR10 rack servers
>>>>>>>>>>>>>>>>>>>>>>> connected to the other Xsigo director.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I can get IPoIB to work between the two test rack
>>>>>>>>>>>>>>>>>>>>>>> servers connected to the one Xsigo director. But I can not get IPoIB to
>>>>>>>>>>>>>>>>>>>>>>> work between any blades either right next to each other or to the working
>>>>>>>>>>>>>>>>>>>>>>> rack servers. I'm using the same exact live CentOS ISO on all four servers.
>>>>>>>>>>>>>>>>>>>>>>> I've checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>>>>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>>>>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>>>>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I've read about MulticastFDBTop=0xBFFF but I don't
>>>>>>>>>>>>>>>>>>>>>>> know how to set it and I doubt it would have been set by default.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Anyone have some ideas on troubleshooting steps to
>>>>>>>>>>>>>>>>>>>>>>> try? I think Google is tired of me asking questions about it.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list
>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ====================================
>>>>>>>>>
>>>>>>>>> Susan Coulter
>>>>>>>>> HPC-3 Network/Infrastructure
>>>>>>>>> 505-667-8425
>>>>>>>>> Increase the Peace...
>>>>>>>>> An eye for an eye leaves the whole world blind
>>>>>>>>> ====================================
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ====================================
>>>>>>>>
>>>>>>>> Susan Coulter
>>>>>>>> HPC-3 Network/Infrastructure
>>>>>>>> 505-667-8425
>>>>>>>> Increase the Peace...
>>>>>>>> An eye for an eye leaves the whole world blind
>>>>>>>> ====================================
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131113/71a255e5/attachment.html>
More information about the Users
mailing list