[Users] Weird IPoIB issue

Mon Oct 28 11:53:03 PDT 2013

Doesn't look to me like this is an SM issue at this point. Some host/ESXi
debug is needed.

On Mon, Oct 28, 2013 at 2:38 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:

> These ESX hosts (2 blade server and 2 rack servers) are booted into a
> CentOS 6.2 Live CD that I built. Right now everything I'm trying to get
> working is CentOS 6.2. All of our other hosts are running ESXi and have
> IPoIB interfaces, but none of them are configured and I'm not trying to get
> those working right now.
>
> Ideally, we would like our ESX hosts to communicate with each other for
> vMotion and protected VM traffic as well as with our Commvault backup
> servers (Windows) over IPoIB (or Oracle's PVI which is very similar).
>
>
> Robert LeBlanc
> OIT Infrastructure & Virtualization Engineer
> Brigham Young University
>
>
> On Mon, Oct 28, 2013 at 12:33 PM, Hal Rosenstock <hal.rosenstock at gmail.com
> > wrote:
>
>> Are those ESXi IPoIB interfaces ? Do some of these work and others not ?
>> Are there normal Linux IPoIB interfaces ? Do they work ?
>>
>>
>> On Mon, Oct 28, 2013 at 2:24 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>
>>> Yes, I can not ping them over the IPoIB interface. It is a very simple
>>> network set-up.
>>>
>>> desxi003
>>> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast
>>> state UP qlen 256
>>>     link/infiniband
>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>     inet 192.168.9.3/24 brd 192.168.9.255 scope global ib0
>>>     inet6 fe80::f24d:a290:9778:e7d1/64 scope link
>>>        valid_lft forever preferred_lft forever
>>>
>>> desxi004
>>> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast
>>> state UP qlen 256
>>>     link/infiniband
>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>     inet 192.168.9.4/24 brd 192.168.9.255 scope global ib0
>>>     inet6 fe80::f24d:a290:9778:e715/64 scope link
>>>        valid_lft forever preferred_lft forever
>>>
>>>
>>>
>>> Robert LeBlanc
>>> OIT Infrastructure & Virtualization Engineer
>>> Brigham Young University
>>>
>>>
>>> On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock <
>>> hal.rosenstock at gmail.com> wrote:
>>>
>>>> So these 2 hosts have trouble talking IPoIB to each other ?
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <robert_leblanc at byu.edu
>>>> > wrote:
>>>>
>>>>> I was just wondering about that. It seems reasonable that the
>>>>> broadcast traffic would go over multicast, but effectively channels would
>>>>> be created for node to node communication, otherwise the entire multicast
>>>>> group would be limited to 10 Gbps (in this instance) for the whole group.
>>>>> That doesn't scale very well.
>>>>>
>>>>> The things I've read about IPoIB performance tuning seems pretty
>>>>> vague, and the changes most people recommend seem to be already in place on
>>>>> the systems I'm using. Some people said, try using a newer version of
>>>>> Ubuntu, but ultimately, I have very little control over VMware. Once I can
>>>>> get the Linux machines to communicate IPoIB between the racks and blades,
>>>>> then I'm going to turn my attention over to performance optimization. It
>>>>> doesn't seem to make much sense to spend time there when it is not working
>>>>> at all for most machines.
>>>>>
>>>>> I've done ibtracert between the two nodes, is that what you mean by
>>>>> walking the route?
>>>>>
>>>>> [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>>>>> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV Mellanox
>>>>> Technologies"
>>>>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
>>>>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>>
>>>>> [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>>>>> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>>> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV Mellanox
>>>>> Technologies"
>>>>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
>>>>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>>>
>>>>> As you can see, the route is on the same switch, the blades are right
>>>>> next to each other.
>>>>>
>>>>>
>>>>> Robert LeBlanc
>>>>> OIT Infrastructure & Virtualization Engineer
>>>>> Brigham Young University
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>
>>>>>> Which mystery is explained ? The 10 Gbps is a multicast only limit
>>>>>> and does not apply to unicast. The BW limitation you're seeing is due to
>>>>>> other factors. There's been much written about IPoIB performance.
>>>>>>
>>>>>> If all the MC members are joined and routed, then the IPoIB
>>>>>> connectivity issue is some other issue. Are you sure this is the case ? Did
>>>>>> you walk the route between 2 nodes where you have a connectivity issue ?
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <
>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>
>>>>>>> Well, that explains one mystery, now I need to figure out why it
>>>>>>> seems the Dell blades are not passing the traffic.
>>>>>>>
>>>>>>>
>>>>>>> Robert LeBlanc
>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>> Brigham Young University
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes, that's the IPoIB IPv4 broadcast group for the default (0xffff)
>>>>>>>> partition. 0x80 part of mtu and rate just means "is equal to". mtu 0x04 is
>>>>>>>> 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the defaults.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>
>>>>>>>>> The info for that MGID is:
>>>>>>>>> MCMemberRecord group dump:
>>>>>>>>>                 MGID....................ff12:401b:ffff::ffff:ffff
>>>>>>>>>                 Mlid....................0xC000
>>>>>>>>>                 Mtu.....................0x84
>>>>>>>>>                 pkey....................0xFFFF
>>>>>>>>>                 Rate....................0x83
>>>>>>>>>                 SL......................0x0
>>>>>>>>>
>>>>>>>>> I don't understand the MTU and Rate (130 and 131 dec). When I run
>>>>>>>>> iperf between the two hosts over IPoIB in connected mode and MTU 65520.
>>>>>>>>> I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Robert LeBlanc
>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>> Brigham Young University
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> saquery -g should show what MGID is mapped to MLID 0xc000 and the
>>>>>>>>>> group parameters.
>>>>>>>>>>
>>>>>>>>>> When you say 10 Gbps max, is that multicast or unicast ? That
>>>>>>>>>> limit is only on the multicast.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Well, that can explain why I'm only able to get 10 Gbps max from
>>>>>>>>>>> the two hosts that are working.
>>>>>>>>>>>
>>>>>>>>>>> I have tried updn and dnup and they didn't help either. I think
>>>>>>>>>>> the only thing that will help is Automatic Path Migration is it tries very
>>>>>>>>>>> hard to route the alternative LIDs through different systemguids. I suspect
>>>>>>>>>>> it would require re-LIDing everything which would mean an outage. I'm still
>>>>>>>>>>> trying to get answers from Oracle if that is even a possibility. I've tried
>>>>>>>>>>> seeding some of the algorithms with information like root nodes, etc, but
>>>>>>>>>>> none of them worked better.
>>>>>>>>>>>
>>>>>>>>>>> The MLID 0xc000 exists and I can see all the nodes joined to the
>>>>>>>>>>> group using saquery. I've checked the route using ibtracert specifying the
>>>>>>>>>>> MLID. The only thing I'm not sure how to check is the group parameters.
>>>>>>>>>>> What tool would I use for that?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Xsigo's SM is not "straight" OpenSM. They have some proprietary
>>>>>>>>>>>> enhancements and it may be based on old vintage of OpenSM. You will likely
>>>>>>>>>>>> need to work with them/Oracle now on issues.
>>>>>>>>>>>>
>>>>>>>>>>>> Lack of a partitions file does mean default partition and
>>>>>>>>>>>> default rate (10 Gbps) so from what I saw all ports had sufficient rate to
>>>>>>>>>>>> join MC group.
>>>>>>>>>>>>
>>>>>>>>>>>> There are certain topology requirements for running various
>>>>>>>>>>>> routing algorithms. Did you try updn or dnup ?
>>>>>>>>>>>>
>>>>>>>>>>>> The key is determining whether the IPoIB broadcast group is
>>>>>>>>>>>> setup correctly. What MLID is the group built on (usually 0xc000) ? What
>>>>>>>>>>>> are the group parameters (rate, MTU) ? Are all members that are running
>>>>>>>>>>>> IPoIB joined ? Is the group routed to all such members ? There are
>>>>>>>>>>>> infiniband-diags for all of this.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is using
>>>>>>>>>>>>> minhop. I've loaded the ibnetdiscover output into ibsim and run all the
>>>>>>>>>>>>> different routing algorithms against it with and without scatter ports.
>>>>>>>>>>>>> Minhop had 50% of our hosts running all paths through a single IS5030
>>>>>>>>>>>>> switch (at least the LIDs we need which represent Ethernet and Fibre
>>>>>>>>>>>>> Channel cards the hosts should communicate with). Ftree, dor, and dfsssp
>>>>>>>>>>>>> failed back to minhop, the others routed more paths through the same IS5030
>>>>>>>>>>>>> in some cases increasing our host count with single point of failure to 75%.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As far as I can tell there is no partitions.conf file so I
>>>>>>>>>>>>> assume we are using the default partition. There is an opensm.opts file,
>>>>>>>>>>>>> but it only specifies logging information.
>>>>>>>>>>>>> # SA database file name
>>>>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>>>>
>>>>>>>>>>>>> # If TRUE causes OpenSM to dump SA database at the end of
>>>>>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>>>>
>>>>>>>>>>>>> # The directory to hold the file OpenSM dumps
>>>>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>>>>
>>>>>>>>>>>>> The SM node is:
>>>>>>>>>>>>> xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>>>>>
>>>>>>>>>>>>> We do have Switch-X in two of the Dell m1000e chassis but the
>>>>>>>>>>>>> cards, ports 17-32, are FDR10 (the switch may be straight FDR, but I'm not
>>>>>>>>>>>>> 100% sure). The IS5030 are QDR which the Switch-X are connected to, the
>>>>>>>>>>>>> switches in the Xsigo directors are QDR, but the Ethernet and Fibre Channel
>>>>>>>>>>>>> cards are DDR. The DDR cards will not be running IPoIB (at least to my
>>>>>>>>>>>>> knowledge they don't have the ability), only the hosts should be leveraging
>>>>>>>>>>>>> IPoIB. I hope that clears up some of your questions. If you have more, I
>>>>>>>>>>>>> will try to answer them.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> What routing algorithm is configured in OpenSM ? What does
>>>>>>>>>>>>>> your partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, I only see QDR and DDR links although you have Switch-X
>>>>>>>>>>>>>> so I assume all FDR ports are connected to slower (QDR) devices. I don't
>>>>>>>>>>>>>> see any FDR-10 ports but maybe they're also connected to QDR ports so show
>>>>>>>>>>>>>> up as QDR in the topology.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether or not
>>>>>>>>>>>>>> they run IPoIB.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since you guys are amazingly helpful, I thought I would pick
>>>>>>>>>>>>>>> your brains in a new problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have two Xsigo directors cross connected to four Mellanox
>>>>>>>>>>>>>>> IS5030 switches. Connected to those we have four Dell m1000e chassis each
>>>>>>>>>>>>>>> with two IB switches (two chassis have QDR and two have FDR10). We have 9
>>>>>>>>>>>>>>> dual-port rack servers connected to the IS5030 switches. For testing
>>>>>>>>>>>>>>> purposes we have an additional Dell m1000e QDR chassis connected to one
>>>>>>>>>>>>>>> Xsigo director and two dual-port FDR10 rack servers connected to the other
>>>>>>>>>>>>>>> Xsigo director.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can get IPoIB to work between the two test rack servers
>>>>>>>>>>>>>>> connected to the one Xsigo director. But I can not get IPoIB to work
>>>>>>>>>>>>>>> between any blades either right next to each other or to the working rack
>>>>>>>>>>>>>>> servers. I'm using the same exact live CentOS ISO on all four servers. I've
>>>>>>>>>>>>>>> checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've read about MulticastFDBTop=0xBFFF but I don't know how
>>>>>>>>>>>>>>> to set it and I doubt it would have been set by default.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyone have some ideas on troubleshooting steps to try? I
>>>>>>>>>>>>>>> think Google is tired of me asking questions about it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131028/31c97cde/attachment.html>