[Users] Weird IPoIB issue

Mon Oct 28 11:22:08 PDT 2013

So these 2 hosts have trouble talking IPoIB to each other ?

On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:

> I was just wondering about that. It seems reasonable that the broadcast
> traffic would go over multicast, but effectively channels would be created
> for node to node communication, otherwise the entire multicast group would
> be limited to 10 Gbps (in this instance) for the whole group. That doesn't
> scale very well.
>
> The things I've read about IPoIB performance tuning seems pretty vague,
> and the changes most people recommend seem to be already in place on the
> systems I'm using. Some people said, try using a newer version of Ubuntu,
> but ultimately, I have very little control over VMware. Once I can get the
> Linux machines to communicate IPoIB between the racks and blades, then I'm
> going to turn my attention over to performance optimization. It doesn't
> seem to make much sense to spend time there when it is not working at all
> for most machines.
>
> I've done ibtracert between the two nodes, is that what you mean by
> walking the route?
>
> [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV Mellanox
> Technologies"
> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>
> [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV Mellanox
> Technologies"
> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>
> As you can see, the route is on the same switch, the blades are right next
> to each other.
>
>
> Robert LeBlanc
> OIT Infrastructure & Virtualization Engineer
> Brigham Young University
>
>
> On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <hal.rosenstock at gmail.com
> > wrote:
>
>> Which mystery is explained ? The 10 Gbps is a multicast only limit and
>> does not apply to unicast. The BW limitation you're seeing is due to other
>> factors. There's been much written about IPoIB performance.
>>
>> If all the MC members are joined and routed, then the IPoIB connectivity
>> issue is some other issue. Are you sure this is the case ? Did you walk the
>> route between 2 nodes where you have a connectivity issue ?
>>
>>
>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>
>>> Well, that explains one mystery, now I need to figure out why it seems
>>> the Dell blades are not passing the traffic.
>>>
>>>
>>> Robert LeBlanc
>>> OIT Infrastructure & Virtualization Engineer
>>> Brigham Young University
>>>
>>>
>>> On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>> hal.rosenstock at gmail.com> wrote:
>>>
>>>> Yes, that's the IPoIB IPv4 broadcast group for the default (0xffff)
>>>> partition. 0x80 part of mtu and rate just means "is equal to". mtu 0x04 is
>>>> 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the defaults.
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <robert_leblanc at byu.edu
>>>> > wrote:
>>>>
>>>>> The info for that MGID is:
>>>>> MCMemberRecord group dump:
>>>>>                 MGID....................ff12:401b:ffff::ffff:ffff
>>>>>                 Mlid....................0xC000
>>>>>                 Mtu.....................0x84
>>>>>                 pkey....................0xFFFF
>>>>>                 Rate....................0x83
>>>>>                 SL......................0x0
>>>>>
>>>>> I don't understand the MTU and Rate (130 and 131 dec). When I run
>>>>> iperf between the two hosts over IPoIB in connected mode and MTU 65520.
>>>>> I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>
>>>>>
>>>>> Robert LeBlanc
>>>>> OIT Infrastructure & Virtualization Engineer
>>>>> Brigham Young University
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>
>>>>>> saquery -g should show what MGID is mapped to MLID 0xc000 and the
>>>>>> group parameters.
>>>>>>
>>>>>> When you say 10 Gbps max, is that multicast or unicast ? That limit
>>>>>> is only on the multicast.
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>
>>>>>>> Well, that can explain why I'm only able to get 10 Gbps max from the
>>>>>>> two hosts that are working.
>>>>>>>
>>>>>>> I have tried updn and dnup and they didn't help either. I think the
>>>>>>> only thing that will help is Automatic Path Migration is it tries very hard
>>>>>>> to route the alternative LIDs through different systemguids. I suspect it
>>>>>>> would require re-LIDing everything which would mean an outage. I'm still
>>>>>>> trying to get answers from Oracle if that is even a possibility. I've tried
>>>>>>> seeding some of the algorithms with information like root nodes, etc, but
>>>>>>> none of them worked better.
>>>>>>>
>>>>>>> The MLID 0xc000 exists and I can see all the nodes joined to the
>>>>>>> group using saquery. I've checked the route using ibtracert specifying the
>>>>>>> MLID. The only thing I'm not sure how to check is the group parameters.
>>>>>>> What tool would I use for that?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>> Robert LeBlanc
>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>> Brigham Young University
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>
>>>>>>>> Xsigo's SM is not "straight" OpenSM. They have some proprietary
>>>>>>>> enhancements and it may be based on old vintage of OpenSM. You will likely
>>>>>>>> need to work with them/Oracle now on issues.
>>>>>>>>
>>>>>>>> Lack of a partitions file does mean default partition and default
>>>>>>>> rate (10 Gbps) so from what I saw all ports had sufficient rate to join MC
>>>>>>>> group.
>>>>>>>>
>>>>>>>> There are certain topology requirements for running various routing
>>>>>>>> algorithms. Did you try updn or dnup ?
>>>>>>>>
>>>>>>>> The key is determining whether the IPoIB broadcast group is setup
>>>>>>>> correctly. What MLID is the group built on (usually 0xc000) ? What are the
>>>>>>>> group parameters (rate, MTU) ? Are all members that are running IPoIB
>>>>>>>> joined ? Is the group routed to all such members ? There are
>>>>>>>> infiniband-diags for all of this.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>
>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is using minhop.
>>>>>>>>> I've loaded the ibnetdiscover output into ibsim and run all the different
>>>>>>>>> routing algorithms against it with and without scatter ports. Minhop had
>>>>>>>>> 50% of our hosts running all paths through a single IS5030 switch (at least
>>>>>>>>> the LIDs we need which represent Ethernet and Fibre Channel cards the hosts
>>>>>>>>> should communicate with). Ftree, dor, and dfsssp failed back to minhop, the
>>>>>>>>> others routed more paths through the same IS5030 in some cases increasing
>>>>>>>>> our host count with single point of failure to 75%.
>>>>>>>>>
>>>>>>>>> As far as I can tell there is no partitions.conf file so I assume
>>>>>>>>> we are using the default partition. There is an opensm.opts file, but it
>>>>>>>>> only specifies logging information.
>>>>>>>>> # SA database file name
>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>
>>>>>>>>> # If TRUE causes OpenSM to dump SA database at the end of
>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>
>>>>>>>>> # The directory to hold the file OpenSM dumps
>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>
>>>>>>>>> The SM node is:
>>>>>>>>> xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>
>>>>>>>>> We do have Switch-X in two of the Dell m1000e chassis but the
>>>>>>>>> cards, ports 17-32, are FDR10 (the switch may be straight FDR, but I'm not
>>>>>>>>> 100% sure). The IS5030 are QDR which the Switch-X are connected to, the
>>>>>>>>> switches in the Xsigo directors are QDR, but the Ethernet and Fibre Channel
>>>>>>>>> cards are DDR. The DDR cards will not be running IPoIB (at least to my
>>>>>>>>> knowledge they don't have the ability), only the hosts should be leveraging
>>>>>>>>> IPoIB. I hope that clears up some of your questions. If you have more, I
>>>>>>>>> will try to answer them.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Robert LeBlanc
>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>> Brigham Young University
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> What routing algorithm is configured in OpenSM ? What does your
>>>>>>>>>> partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>
>>>>>>>>>> Also, I only see QDR and DDR links although you have Switch-X so
>>>>>>>>>> I assume all FDR ports are connected to slower (QDR) devices. I don't see
>>>>>>>>>> any FDR-10 ports but maybe they're also connected to QDR ports so show up
>>>>>>>>>> as QDR in the topology.
>>>>>>>>>>
>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether or not they
>>>>>>>>>> run IPoIB.
>>>>>>>>>>
>>>>>>>>>> -- Hal
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Since you guys are amazingly helpful, I thought I would pick
>>>>>>>>>>> your brains in a new problem.
>>>>>>>>>>>
>>>>>>>>>>> We have two Xsigo directors cross connected to four Mellanox
>>>>>>>>>>> IS5030 switches. Connected to those we have four Dell m1000e chassis each
>>>>>>>>>>> with two IB switches (two chassis have QDR and two have FDR10). We have 9
>>>>>>>>>>> dual-port rack servers connected to the IS5030 switches. For testing
>>>>>>>>>>> purposes we have an additional Dell m1000e QDR chassis connected to one
>>>>>>>>>>> Xsigo director and two dual-port FDR10 rack servers connected to the other
>>>>>>>>>>> Xsigo director.
>>>>>>>>>>>
>>>>>>>>>>> I can get IPoIB to work between the two test rack servers
>>>>>>>>>>> connected to the one Xsigo director. But I can not get IPoIB to work
>>>>>>>>>>> between any blades either right next to each other or to the working rack
>>>>>>>>>>> servers. I'm using the same exact live CentOS ISO on all four servers. I've
>>>>>>>>>>> checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>
>>>>>>>>>>> I've read about MulticastFDBTop=0xBFFF but I don't know how to
>>>>>>>>>>> set it and I doubt it would have been set by default.
>>>>>>>>>>>
>>>>>>>>>>> Anyone have some ideas on troubleshooting steps to try? I think
>>>>>>>>>>> Google is tired of me asking questions about it.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list
>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131028/55118a83/attachment.html>