[Users] Weird IPoIB issue

Robert LeBlanc robert_leblanc at byu.edu
Wed Nov 13 10:55:22 PST 2013


[root at desxi003 ~]# flint -d /dev/mst/SW_MT48438_0x2c90200448e28_lid-0x0034 q
Image type:      FS2
FW Version:      7.4.0
Device ID:       48438
Description:     Node             Sys image
GUIDs:           0002c90200448e28 0002c90200448e2b
Board ID:        n/a (DEL08D0110003)
VSD:             n/a
PSID:            DEL08D0110003



Robert LeBlanc
OIT Infrastructure & Virtualization Engineer
Brigham Young University


On Wed, Nov 13, 2013 at 11:52 AM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:

> What's the latest firmware version ?
>
> Can you determine the firmware version of the switches ? vendstat -N
> <switch lid> might work to show this.
>
> This is important...
>
> Thanks.
>
> -- Hal
>
>
> On Wed, Nov 13, 2013 at 1:46 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>
>> Thanks for all the help so far, this is a great community! I've fed all
>> this info back to Oracle and I'll have to see what they say.
>>
>> Thanks,
>>
>>
>> Robert LeBlanc
>> OIT Infrastructure & Virtualization Engineer
>> Brigham Young University
>>
>>
>> On Wed, Nov 13, 2013 at 11:40 AM, Hal Rosenstock <
>> hal.rosenstock at gmail.com> wrote:
>>
>>> Yes, this is the cause of the issues.
>>>
>>> smpdump (and smpquery) merely query (get) and don't set parameters and
>>> anyhow, the SM would overwrite it when it thought it needed to update it.
>>> It's an SM and/or firmware issue.
>>>
>>>
>>> On Wed, Nov 13, 2013 at 1:38 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>>
>>>> We are on the latest version of firmware for all of our switches (as of
>>>> last month). I guess I'll have to check with Oracle and see if they are
>>>> setting this parameter in their subnet manager. Just to confirm, using
>>>> smpdump (or similar) to change the value won't do any good because the
>>>> subnet manager will just change it back?
>>>>
>>>> I think this is the cause of the problems, now to get it fixed.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Robert LeBlanc
>>>> OIT Infrastructure & Virtualization Engineer
>>>> Brigham Young University
>>>>
>>>>
>>>> On Wed, Nov 13, 2013 at 11:34 AM, Hal Rosenstock <
>>>> hal.rosenstock at gmail.com> wrote:
>>>>
>>>>> In general, MulticastFDBTop should be 0 or some value above 0xc000.
>>>>>
>>>>>
>>>>> Indicates the upper bound of the range of the multicast
>>>>>
>>>>> forwarding table. Packets received with MLIDs greater
>>>>>
>>>>> than MulticastFDBTop are considered to be outside the
>>>>>
>>>>> range of the Multicast Forwarding Table (see
>>>>>
>>>>> 18.2.4.3.3
>>>>>
>>>>> Required Multicast Relay on page 1072
>>>>>
>>>>> ). A valid MulticastFDBTop
>>>>>
>>>>> is less than MulticastFDBCap + 0xC000.
>>>>>
>>>>> This component applies only to switches that implement
>>>>>
>>>>> the optional multicast forwarding service. A switch
>>>>>
>>>>> shall ignore the MulticastFDBTop component if it has
>>>>>
>>>>> the value zero. The initial value for MulticastFDBTop
>>>>>
>>>>> shall be set to zero. A value of 0xBFFF means there are
>>>>>
>>>>> no MulticastForwardingTable entries.
>>>>> It is set by OpenSM. There is a parameter to disable it's use
>>>>> (use_mfttop) which can be set to FALSE. This may depend on which OpenSM
>>>>> version you are running. In order to get out of this state, you may need to
>>>>> reset any switches which have this parameter set like this.
>>>>>
>>>>> Any idea on the firmware versions in your various switches ?
>>>>>
>>>>> -- Hal
>>>>>
>>>>>
>>>>> On Wed, Nov 13, 2013 at 1:16 PM, Robert LeBlanc <
>>>>> robert_leblanc at byu.edu> wrote:
>>>>>
>>>>>> Sorry to take so long, I've been busy with other things. Here is the
>>>>>> output:
>>>>>>
>>>>>> [root at desxi003 ~]# smpquery si 52
>>>>>> # Switch info: Lid 52
>>>>>> LinearFdbCap:....................49152
>>>>>> RandomFdbCap:....................0
>>>>>> McastFdbCap:.....................4096
>>>>>> LinearFdbTop:....................189
>>>>>> DefPort:.........................0
>>>>>> DefMcastPrimPort:................255
>>>>>> DefMcastNotPrimPort:.............255
>>>>>> LifeTime:........................18
>>>>>> StateChange:.....................0
>>>>>> OptSLtoVLMapping:................1
>>>>>> LidsPerPort:.....................0
>>>>>> PartEnforceCap:..................32
>>>>>> InboundPartEnf:..................1
>>>>>> OutboundPartEnf:.................1
>>>>>> FilterRawInbound:................1
>>>>>> FilterRawOutbound:...............1
>>>>>> EnhancedPort0:...................0
>>>>>> MulticastFDBTop:.................0xbfff
>>>>>> [root at desxi003 ~]# smpquery pi 52 0
>>>>>> # Port info: Lid 52 port 0
>>>>>> Mkey:............................0x0000000000000000
>>>>>> GidPrefix:.......................0xfe80000000000000
>>>>>> Lid:.............................52
>>>>>> SMLid:...........................49
>>>>>> CapMask:.........................0x42500848
>>>>>>                                 IsTrapSupported
>>>>>>                                 IsSLMappingSupported
>>>>>>                                 IsSystemImageGUIDsupported
>>>>>>                                 IsVendorClassSupported
>>>>>>                                  IsCapabilityMaskNoticeSupported
>>>>>>                                 IsClientRegistrationSupported
>>>>>>                                 IsMulticastFDBTopSupported
>>>>>> DiagCode:........................0x0000
>>>>>> MkeyLeasePeriod:.................0
>>>>>> LocalPort:.......................1
>>>>>> LinkWidthEnabled:................1X or 4X
>>>>>> LinkWidthSupported:..............1X or 4X
>>>>>> LinkWidthActive:.................4X
>>>>>> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
>>>>>> LinkState:.......................Active
>>>>>> PhysLinkState:...................LinkUp
>>>>>> LinkDownDefState:................Polling
>>>>>> ProtectBits:.....................0
>>>>>> LMC:.............................0
>>>>>> LinkSpeedActive:.................10.0 Gbps
>>>>>> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
>>>>>> NeighborMTU:.....................4096
>>>>>> SMSL:............................0
>>>>>> VLCap:...........................VL0
>>>>>> InitType:........................0x00
>>>>>> VLHighLimit:.....................0
>>>>>> VLArbHighCap:....................0
>>>>>> VLArbLowCap:.....................0
>>>>>> InitReply:.......................0x00
>>>>>> MtuCap:..........................4096
>>>>>> VLStallCount:....................0
>>>>>> HoqLife:.........................0
>>>>>> OperVLs:.........................VL0
>>>>>> PartEnforceInb:..................0
>>>>>> PartEnforceOutb:.................0
>>>>>> FilterRawInb:....................0
>>>>>> FilterRawOutb:...................0
>>>>>> MkeyViolations:..................0
>>>>>> PkeyViolations:..................0
>>>>>> QkeyViolations:..................0
>>>>>> GuidCap:.........................1
>>>>>> ClientReregister:................0
>>>>>> McastPkeyTrapSuppressionEnabled:.0
>>>>>> SubnetTimeout:...................18
>>>>>> RespTimeVal:.....................20
>>>>>> LocalPhysErr:....................0
>>>>>> OverrunErr:......................0
>>>>>> MaxCreditHint:...................0
>>>>>> RoundTrip:.......................0
>>>>>>
>>>>>> From what I've read in the Mellanox Release
>>>>>> Notes MultiCastFDBTop=0xBFFF is supposed to discard MC traffic. The
>>>>>> question is, how do I set this value to something else and what should it
>>>>>> be set to?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> Robert LeBlanc
>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>> Brigham Young University
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 30, 2013 at 12:28 PM, Hal Rosenstock <
>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>
>>>>>>>  Determine LID of switch (in the below say switch is lid x)
>>>>>>> Then:
>>>>>>>
>>>>>>> smpquery si x
>>>>>>> (of interest are McastFdbCap and MulticastFDBTop)
>>>>>>>  smpquery pi x 0
>>>>>>> (of interest is CapMask)
>>>>>>> ibroute -M x
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 29, 2013 at 3:56 PM, Robert LeBlanc <
>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>
>>>>>>>> Both ports show up in the "saquery MCMR" results with a JoinState
>>>>>>>> of 0x1.
>>>>>>>>
>>>>>>>> How can I dump the parameters of a non-managed switch so that I can
>>>>>>>> confirm that multicast is not turned off on the Dell chassis IB switches?
>>>>>>>>
>>>>>>>>
>>>>>>>> Robert LeBlanc
>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>> Brigham Young University
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2013 at 5:04 PM, Coulter, Susan K <skc at lanl.gov>wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  /sys/class/net should give you the details on your devices, like
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>>  -bash-4.1# cd /sys/class/net
>>>>>>>>> -bash-4.1# ls -l
>>>>>>>>> total 0
>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth0 ->
>>>>>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.0/net/eth0
>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth1 ->
>>>>>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.1/net/eth1
>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib0 ->
>>>>>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib0
>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib1 ->
>>>>>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib1
>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib2 ->
>>>>>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib2
>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib3 ->
>>>>>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib3
>>>>>>>>>
>>>>>>>>>  Then use "lspci | grep Mell"  to get the pci device numbers.
>>>>>>>>>
>>>>>>>>>  47:00.0 Network controller: Mellanox Technologies MT26428
>>>>>>>>> [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>>>>> c7:00.0 Network controller: Mellanox Technologies MT26428
>>>>>>>>> [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>>>>>
>>>>>>>>>  In this example, ib0 and 1 are referencing the device at  47:00.0
>>>>>>>>> And ib2 and ib3 are referencing the device at c7:00.0
>>>>>>>>>
>>>>>>>>>  That said, if you only have one card - this is probably not the
>>>>>>>>> problem.
>>>>>>>>> Additionally, since the arp requests are being seen going out ib0,
>>>>>>>>> your emulation appears to be working.
>>>>>>>>>
>>>>>>>>>  If those arp requests are not being seen on the other end, it
>>>>>>>>> seems like a problem with the mgids.
>>>>>>>>> Like maybe the port you are trying to reach is not in the IPoIB
>>>>>>>>> multicast group?
>>>>>>>>>
>>>>>>>>>  You can look at all the multicast member records with "saquery
>>>>>>>>> MCMR".
>>>>>>>>> Or - you can grep for mcmr_rcv_join_mgrp references in your SM
>>>>>>>>> logs …
>>>>>>>>>
>>>>>>>>>  HTH
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  On Oct 28, 2013, at 1:08 PM, Robert LeBlanc <
>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>
>>>>>>>>>  I can ibping between both hosts just fine.
>>>>>>>>>
>>>>>>>>>  [root at desxi003 ~]# ibping 0x37
>>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.111 ms
>>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.179 ms
>>>>>>>>> ^C
>>>>>>>>> --- desxi004.(none) (Lid 55) ibping statistics ---
>>>>>>>>> 4 packets transmitted, 4 received, 0% packet loss, time 3086 ms
>>>>>>>>> rtt min/avg/max = 0.111/0.167/0.189 ms
>>>>>>>>>
>>>>>>>>>  [root at desxi004 ~]# ibping 0x2d
>>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.156 ms
>>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.175 ms
>>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.176 ms
>>>>>>>>> ^C
>>>>>>>>> --- desxi003.(none) (Lid 45) ibping statistics ---
>>>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2302 ms
>>>>>>>>> rtt min/avg/max = 0.156/0.169/0.176 ms
>>>>>>>>>
>>>>>>>>>  When I do an Ethernet ping to the IPoIB address, tcpdump only
>>>>>>>>> shows the outgoing ARP request.
>>>>>>>>>
>>>>>>>>>  [root at desxi003 ~]# tcpdump -i ib0
>>>>>>>>> tcpdump: verbose output suppressed, use -v or -vv for full
>>>>>>>>> protocol decode
>>>>>>>>> listening on ib0, link-type LINUX_SLL (Linux cooked), capture size
>>>>>>>>> 65535 bytes
>>>>>>>>> 19:00:08.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>>>> length 56
>>>>>>>>> 19:00:09.950320 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>>>> length 56
>>>>>>>>> 19:00:10.950307 ARP, Request who-has 192.168.9.4 tell 192.168.9.3,
>>>>>>>>> length 56
>>>>>>>>>
>>>>>>>>>  Running tcpdump on the rack servers I don't see the ARP request
>>>>>>>>> there which I should.
>>>>>>>>>
>>>>>>>>>  From what I've read, ib0 should be mapped to the first port and
>>>>>>>>> ib1 should be mapped to the second port. We have one IB card with two
>>>>>>>>> ports. The modprobe is the default installed with the Mellanox drivers.
>>>>>>>>>
>>>>>>>>>  [root at desxi003 etc]# cat modprobe.d/ib_ipoib.conf
>>>>>>>>> # install ib_ipoib modprobe --ignore-install ib_ipoib &&
>>>>>>>>> /sbin/ib_ipoib_sysctl load
>>>>>>>>> # remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe -r
>>>>>>>>> --ignore-remove ib_ipoib
>>>>>>>>> alias ib0 ib_ipoib
>>>>>>>>> alias ib1 ib_ipoib
>>>>>>>>>
>>>>>>>>>  Can you give me some pointers on digging into the device layer
>>>>>>>>> to make sure IPoIB is connected correctly? Would I look in /sys or /proc
>>>>>>>>> for that?
>>>>>>>>>
>>>>>>>>>  Dell has not been able to replicate the problem in their
>>>>>>>>> environment and they only support Red Hat and won't work with my CentOS
>>>>>>>>> live CD. These blades don't have internal hard drives so it makes it hard
>>>>>>>>> to install any OS. I don't know if I can engage Mellanox since they build
>>>>>>>>> the switch hardware and driver stack we are using.
>>>>>>>>>
>>>>>>>>>  I really appreciate all the help you guys have given thus far,
>>>>>>>>> I'm learning a lot as this progresses. I'm reading through
>>>>>>>>> https://tools.ietf.org/html/rfc4391 trying to understand IPoIB
>>>>>>>>> from top to bottom.
>>>>>>>>>
>>>>>>>>>  Thanks,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Robert LeBlanc
>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>> Brigham Young University
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 12:53 PM, Coulter, Susan K <skc at lanl.gov>wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  If you are not seeing any packets leave the ib0 interface, it
>>>>>>>>>> sounds like the emulation layer is not connected to the right device.
>>>>>>>>>>
>>>>>>>>>>  If ib_ipoib kernel module is loaded, and a simple native IB
>>>>>>>>>> test works between those blades - (like ib_read_bw) you need to dig into
>>>>>>>>>> the device layer and insure ipoib is "connected" to the right device.
>>>>>>>>>>
>>>>>>>>>>  Do you have more than 1 IB card?
>>>>>>>>>> What does your modprobe config look like for ipoib?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   On Oct 28, 2013, at 12:38 PM, Robert LeBlanc <
>>>>>>>>>> robert_leblanc at byu.edu>
>>>>>>>>>>   wrote:
>>>>>>>>>>
>>>>>>>>>>  These ESX hosts (2 blade server and 2 rack servers) are booted
>>>>>>>>>> into a CentOS 6.2 Live CD that I built. Right now everything I'm trying to
>>>>>>>>>> get working is CentOS 6.2. All of our other hosts are running ESXi and have
>>>>>>>>>> IPoIB interfaces, but none of them are configured and I'm not trying to get
>>>>>>>>>> those working right now.
>>>>>>>>>>
>>>>>>>>>>  Ideally, we would like our ESX hosts to communicate with each
>>>>>>>>>> other for vMotion and protected VM traffic as well as with our Commvault
>>>>>>>>>> backup servers (Windows) over IPoIB (or Oracle's PVI which is very similar).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>> Brigham Young University
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 28, 2013 at 12:33 PM, Hal Rosenstock <
>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Are those ESXi IPoIB interfaces ? Do some of these work and
>>>>>>>>>>> others not ? Are there normal Linux IPoIB interfaces ? Do they work ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 28, 2013 at 2:24 PM, Robert LeBlanc <
>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, I can not ping them over the IPoIB interface. It is a very
>>>>>>>>>>>> simple network set-up.
>>>>>>>>>>>>
>>>>>>>>>>>>  desxi003
>>>>>>>>>>>>  8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>>>>>>>> pfifo_fast state UP qlen 256
>>>>>>>>>>>>     link/infiniband
>>>>>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
>>>>>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>>>>>     inet 192.168.9.3/24 brd 192.168.9.255 scope global ib0
>>>>>>>>>>>>     inet6 fe80::f24d:a290:9778:e7d1/64 scope link
>>>>>>>>>>>>        valid_lft forever preferred_lft forever
>>>>>>>>>>>>
>>>>>>>>>>>>  desxi004
>>>>>>>>>>>>  8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc
>>>>>>>>>>>> pfifo_fast state UP qlen 256
>>>>>>>>>>>>     link/infiniband
>>>>>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
>>>>>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>>>>>     inet 192.168.9.4/24 brd 192.168.9.255 scope global ib0
>>>>>>>>>>>>     inet6 fe80::f24d:a290:9778:e715/64 scope link
>>>>>>>>>>>>        valid_lft forever preferred_lft forever
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock <
>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> So these 2 hosts have trouble talking IPoIB to each other ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <
>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was just wondering about that. It seems reasonable that the
>>>>>>>>>>>>>> broadcast traffic would go over multicast, but effectively channels would
>>>>>>>>>>>>>> be created for node to node communication, otherwise the entire multicast
>>>>>>>>>>>>>> group would be limited to 10 Gbps (in this instance) for the whole group.
>>>>>>>>>>>>>> That doesn't scale very well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  The things I've read about IPoIB performance tuning seems
>>>>>>>>>>>>>> pretty vague, and the changes most people recommend seem to be already in
>>>>>>>>>>>>>> place on the systems I'm using. Some people said, try using a newer version
>>>>>>>>>>>>>> of Ubuntu, but ultimately, I have very little control over VMware. Once I
>>>>>>>>>>>>>> can get the Linux machines to communicate IPoIB between the racks and
>>>>>>>>>>>>>> blades, then I'm going to turn my attention over to performance
>>>>>>>>>>>>>> optimization. It doesn't seem to make much sense to spend time there when
>>>>>>>>>>>>>> it is not working at all for most machines.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  I've done ibtracert between the two nodes, is that what you
>>>>>>>>>>>>>> mean by walking the route?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>>>>>>>>>>>>>> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>>>>>>> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV
>>>>>>>>>>>>>> Mellanox Technologies"
>>>>>>>>>>>>>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
>>>>>>>>>>>>>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>>>>>>>>>>>>>> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>>>>>>>>>> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV
>>>>>>>>>>>>>> Mellanox Technologies"
>>>>>>>>>>>>>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
>>>>>>>>>>>>>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  As you can see, the route is on the same switch, the blades
>>>>>>>>>>>>>> are right next to each other.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Which mystery is explained ? The 10 Gbps is a multicast
>>>>>>>>>>>>>>> only limit and does not apply to unicast. The BW limitation you're seeing
>>>>>>>>>>>>>>> is due to other factors. There's been much written about IPoIB performance.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If all the MC members are joined and routed, then the IPoIB
>>>>>>>>>>>>>>> connectivity issue is some other issue. Are you sure this is the case ? Did
>>>>>>>>>>>>>>> you walk the route between 2 nodes where you have a connectivity issue ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <
>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well, that explains one mystery, now I need to figure out
>>>>>>>>>>>>>>>> why it seems the Dell blades are not passing the traffic.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Yes, that's the IPoIB IPv4 broadcast group for the
>>>>>>>>>>>>>>>>> default (0xffff) partition. 0x80 part of mtu and rate just means "is equal
>>>>>>>>>>>>>>>>> to". mtu 0x04 is 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the
>>>>>>>>>>>>>>>>> defaults.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The info for that MGID is:
>>>>>>>>>>>>>>>>>> MCMemberRecord group dump:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>>>>>>>>>>>>>>>                 Mlid....................0xC000
>>>>>>>>>>>>>>>>>>                 Mtu.....................0x84
>>>>>>>>>>>>>>>>>>                 pkey....................0xFFFF
>>>>>>>>>>>>>>>>>>                 Rate....................0x83
>>>>>>>>>>>>>>>>>>                 SL......................0x0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I don't understand the MTU and Rate (130 and 131 dec).
>>>>>>>>>>>>>>>>>> When I run iperf between the two hosts over IPoIB in connected mode and MTU
>>>>>>>>>>>>>>>>>> 65520. I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  saquery -g should show what MGID is mapped to MLID
>>>>>>>>>>>>>>>>>>> 0xc000 and the group parameters.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  When you say 10 Gbps max, is that multicast or unicast
>>>>>>>>>>>>>>>>>>> ? That limit is only on the multicast.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Well, that can explain why I'm only able to get 10 Gbps
>>>>>>>>>>>>>>>>>>>> max from the two hosts that are working.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  I have tried updn and dnup and they didn't help
>>>>>>>>>>>>>>>>>>>> either. I think the only thing that will help is Automatic Path Migration
>>>>>>>>>>>>>>>>>>>> is it tries very hard to route the alternative LIDs through different
>>>>>>>>>>>>>>>>>>>> systemguids. I suspect it would require re-LIDing everything which would
>>>>>>>>>>>>>>>>>>>> mean an outage. I'm still trying to get answers from Oracle if that is even
>>>>>>>>>>>>>>>>>>>> a possibility. I've tried seeding some of the algorithms with information
>>>>>>>>>>>>>>>>>>>> like root nodes, etc, but none of them worked better.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  The MLID 0xc000 exists and I can see all the nodes
>>>>>>>>>>>>>>>>>>>> joined to the group using saquery. I've checked the route using ibtracert
>>>>>>>>>>>>>>>>>>>> specifying the MLID. The only thing I'm not sure how to check is the group
>>>>>>>>>>>>>>>>>>>> parameters. What tool would I use for that?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>  Xsigo's SM is not "straight" OpenSM. They have some
>>>>>>>>>>>>>>>>>>>>> proprietary enhancements and it may be based on old vintage of OpenSM. You
>>>>>>>>>>>>>>>>>>>>> will likely need to work with them/Oracle now on issues.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Lack of a partitions file does mean default partition
>>>>>>>>>>>>>>>>>>>>> and default rate (10 Gbps) so from what I saw all ports had sufficient rate
>>>>>>>>>>>>>>>>>>>>> to join MC group.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> There are certain topology requirements for running
>>>>>>>>>>>>>>>>>>>>> various routing algorithms. Did you try updn or dnup ?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The key is determining whether the IPoIB broadcast
>>>>>>>>>>>>>>>>>>>>> group is setup correctly. What MLID is the group built on (usually 0xc000)
>>>>>>>>>>>>>>>>>>>>> ? What are the group parameters (rate, MTU) ? Are all members that are
>>>>>>>>>>>>>>>>>>>>> running IPoIB joined ? Is the group routed to all such members ? There are
>>>>>>>>>>>>>>>>>>>>> infiniband-diags for all of this.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is
>>>>>>>>>>>>>>>>>>>>>> using minhop. I've loaded the ibnetdiscover output into ibsim and run all
>>>>>>>>>>>>>>>>>>>>>> the different routing algorithms against it with and without scatter ports.
>>>>>>>>>>>>>>>>>>>>>> Minhop had 50% of our hosts running all paths through a single IS5030
>>>>>>>>>>>>>>>>>>>>>> switch (at least the LIDs we need which represent Ethernet and Fibre
>>>>>>>>>>>>>>>>>>>>>> Channel cards the hosts should communicate with). Ftree, dor, and dfsssp
>>>>>>>>>>>>>>>>>>>>>> failed back to minhop, the others routed more paths through the same IS5030
>>>>>>>>>>>>>>>>>>>>>> in some cases increasing our host count with single point of failure to
>>>>>>>>>>>>>>>>>>>>>> 75%.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>  As far as I can tell there is no partitions.conf
>>>>>>>>>>>>>>>>>>>>>> file so I assume we are using the default partition. There is an
>>>>>>>>>>>>>>>>>>>>>> opensm.opts file, but it only specifies logging information.
>>>>>>>>>>>>>>>>>>>>>>  # SA database file name
>>>>>>>>>>>>>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>  # If TRUE causes OpenSM to dump SA database at the
>>>>>>>>>>>>>>>>>>>>>> end of
>>>>>>>>>>>>>>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>>>>>>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>  # The directory to hold the file OpenSM dumps
>>>>>>>>>>>>>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>  The SM node is:
>>>>>>>>>>>>>>>>>>>>>>  xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>>>>>>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>  We do have Switch-X in two of the Dell m1000e
>>>>>>>>>>>>>>>>>>>>>> chassis but the cards, ports 17-32, are FDR10 (the switch may be straight
>>>>>>>>>>>>>>>>>>>>>> FDR, but I'm not 100% sure). The IS5030 are QDR which the Switch-X are
>>>>>>>>>>>>>>>>>>>>>> connected to, the switches in the Xsigo directors are QDR, but the Ethernet
>>>>>>>>>>>>>>>>>>>>>> and Fibre Channel cards are DDR. The DDR cards will not be running IPoIB
>>>>>>>>>>>>>>>>>>>>>> (at least to my knowledge they don't have the ability), only the hosts
>>>>>>>>>>>>>>>>>>>>>> should be leveraging IPoIB. I hope that clears up some of your questions.
>>>>>>>>>>>>>>>>>>>>>> If you have more, I will try to answer them.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>  On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>  What routing algorithm is configured in OpenSM ?
>>>>>>>>>>>>>>>>>>>>>>> What does your partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Also, I only see QDR and DDR links although you have
>>>>>>>>>>>>>>>>>>>>>>> Switch-X so I assume all FDR ports are connected to slower (QDR) devices. I
>>>>>>>>>>>>>>>>>>>>>>> don't see any FDR-10 ports but maybe they're also connected to QDR ports so
>>>>>>>>>>>>>>>>>>>>>>> show up as QDR in the topology.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether
>>>>>>>>>>>>>>>>>>>>>>> or not they run IPoIB.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>  On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>  Since you guys are amazingly helpful, I thought I
>>>>>>>>>>>>>>>>>>>>>>>> would pick your brains in a new problem.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>  We have two Xsigo directors cross connected to
>>>>>>>>>>>>>>>>>>>>>>>> four Mellanox IS5030 switches. Connected to those we have four Dell m1000e
>>>>>>>>>>>>>>>>>>>>>>>> chassis each with two IB switches (two chassis have QDR and two have
>>>>>>>>>>>>>>>>>>>>>>>> FDR10). We have 9 dual-port rack servers connected to the IS5030 switches.
>>>>>>>>>>>>>>>>>>>>>>>> For testing purposes we have an additional Dell m1000e QDR chassis
>>>>>>>>>>>>>>>>>>>>>>>> connected to one Xsigo director and two dual-port FDR10 rack servers
>>>>>>>>>>>>>>>>>>>>>>>> connected to the other Xsigo director.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>  I can get IPoIB to work between the two test rack
>>>>>>>>>>>>>>>>>>>>>>>> servers connected to the one Xsigo director. But I can not get IPoIB to
>>>>>>>>>>>>>>>>>>>>>>>> work between any blades either right next to each other or to the working
>>>>>>>>>>>>>>>>>>>>>>>> rack servers. I'm using the same exact live CentOS ISO on all four servers.
>>>>>>>>>>>>>>>>>>>>>>>> I've checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>>>>>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>>>>>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>>>>>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>  I've read about MulticastFDBTop=0xBFFF but I
>>>>>>>>>>>>>>>>>>>>>>>> don't know how to set it and I doubt it would have been set by default.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>  Anyone have some ideas on troubleshooting steps
>>>>>>>>>>>>>>>>>>>>>>>> to try? I think Google is tired of me asking questions about it.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>  Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>  _______________________________________________
>>>>>>>>>> Users mailing list
>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  ====================================
>>>>>>>>>>
>>>>>>>>>>  Susan Coulter
>>>>>>>>>> HPC-3 Network/Infrastructure
>>>>>>>>>> 505-667-8425
>>>>>>>>>> Increase the Peace...
>>>>>>>>>> An eye for an eye leaves the whole world blind
>>>>>>>>>> ====================================
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  ====================================
>>>>>>>>>
>>>>>>>>>  Susan Coulter
>>>>>>>>> HPC-3 Network/Infrastructure
>>>>>>>>> 505-667-8425
>>>>>>>>> Increase the Peace...
>>>>>>>>> An eye for an eye leaves the whole world blind
>>>>>>>>> ====================================
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131113/ca5ab217/attachment.html>


More information about the Users mailing list