[Users] Weird IPoIB issue

Mon Oct 28 10:58:04 PDT 2013

Well, that explains one mystery, now I need to figure out why it seems the
Dell blades are not passing the traffic.

Robert LeBlanc
OIT Infrastructure & Virtualization Engineer
Brigham Young University

On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:

> Yes, that's the IPoIB IPv4 broadcast group for the default (0xffff)
> partition. 0x80 part of mtu and rate just means "is equal to". mtu 0x04 is
> 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the defaults.
>
>
> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>
>> The info for that MGID is:
>> MCMemberRecord group dump:
>>                 MGID....................ff12:401b:ffff::ffff:ffff
>>                 Mlid....................0xC000
>>                 Mtu.....................0x84
>>                 pkey....................0xFFFF
>>                 Rate....................0x83
>>                 SL......................0x0
>>
>> I don't understand the MTU and Rate (130 and 131 dec). When I run iperf
>> between the two hosts over IPoIB in connected mode and MTU 65520. I've
>> tried multiple threads, but the sum is still 10 Gbps.
>>
>>
>> Robert LeBlanc
>> OIT Infrastructure & Virtualization Engineer
>> Brigham Young University
>>
>>
>> On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>> hal.rosenstock at gmail.com> wrote:
>>
>>> saquery -g should show what MGID is mapped to MLID 0xc000 and the group
>>> parameters.
>>>
>>> When you say 10 Gbps max, is that multicast or unicast ? That limit is
>>> only on the multicast.
>>>
>>>
>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>>
>>>> Well, that can explain why I'm only able to get 10 Gbps max from the
>>>> two hosts that are working.
>>>>
>>>> I have tried updn and dnup and they didn't help either. I think the
>>>> only thing that will help is Automatic Path Migration is it tries very hard
>>>> to route the alternative LIDs through different systemguids. I suspect it
>>>> would require re-LIDing everything which would mean an outage. I'm still
>>>> trying to get answers from Oracle if that is even a possibility. I've tried
>>>> seeding some of the algorithms with information like root nodes, etc, but
>>>> none of them worked better.
>>>>
>>>> The MLID 0xc000 exists and I can see all the nodes joined to the group
>>>> using saquery. I've checked the route using ibtracert specifying the MLID.
>>>> The only thing I'm not sure how to check is the group parameters. What tool
>>>> would I use for that?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Robert LeBlanc
>>>> OIT Infrastructure & Virtualization Engineer
>>>> Brigham Young University
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>> hal.rosenstock at gmail.com> wrote:
>>>>
>>>>> Xsigo's SM is not "straight" OpenSM. They have some proprietary
>>>>> enhancements and it may be based on old vintage of OpenSM. You will likely
>>>>> need to work with them/Oracle now on issues.
>>>>>
>>>>> Lack of a partitions file does mean default partition and default rate
>>>>> (10 Gbps) so from what I saw all ports had sufficient rate to join MC group.
>>>>>
>>>>> There are certain topology requirements for running various routing
>>>>> algorithms. Did you try updn or dnup ?
>>>>>
>>>>> The key is determining whether the IPoIB broadcast group is setup
>>>>> correctly. What MLID is the group built on (usually 0xc000) ? What are the
>>>>> group parameters (rate, MTU) ? Are all members that are running IPoIB
>>>>> joined ? Is the group routed to all such members ? There are
>>>>> infiniband-diags for all of this.
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>> robert_leblanc at byu.edu> wrote:
>>>>>
>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is using minhop. I've
>>>>>> loaded the ibnetdiscover output into ibsim and run all the different
>>>>>> routing algorithms against it with and without scatter ports. Minhop had
>>>>>> 50% of our hosts running all paths through a single IS5030 switch (at least
>>>>>> the LIDs we need which represent Ethernet and Fibre Channel cards the hosts
>>>>>> should communicate with). Ftree, dor, and dfsssp failed back to minhop, the
>>>>>> others routed more paths through the same IS5030 in some cases increasing
>>>>>> our host count with single point of failure to 75%.
>>>>>>
>>>>>> As far as I can tell there is no partitions.conf file so I assume we
>>>>>> are using the default partition. There is an opensm.opts file, but it only
>>>>>> specifies logging information.
>>>>>> # SA database file name
>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>
>>>>>> # If TRUE causes OpenSM to dump SA database at the end of
>>>>>> # every light sweep, regardless of the verbosity level
>>>>>> sa_db_dump TRUE
>>>>>>
>>>>>> # The directory to hold the file OpenSM dumps
>>>>>> dump_files_dir /var/log/
>>>>>>
>>>>>> The SM node is:
>>>>>> xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>
>>>>>> We do have Switch-X in two of the Dell m1000e chassis but the cards,
>>>>>> ports 17-32, are FDR10 (the switch may be straight FDR, but I'm not 100%
>>>>>> sure). The IS5030 are QDR which the Switch-X are connected to, the switches
>>>>>> in the Xsigo directors are QDR, but the Ethernet and Fibre Channel cards
>>>>>> are DDR. The DDR cards will not be running IPoIB (at least to my knowledge
>>>>>> they don't have the ability), only the hosts should be leveraging IPoIB. I
>>>>>> hope that clears up some of your questions. If you have more, I will try to
>>>>>> answer them.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Robert LeBlanc
>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>> Brigham Young University
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>
>>>>>>> What routing algorithm is configured in OpenSM ? What does your
>>>>>>> partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>
>>>>>>> Also, I only see QDR and DDR links although you have Switch-X so I
>>>>>>> assume all FDR ports are connected to slower (QDR) devices. I don't see any
>>>>>>> FDR-10 ports but maybe they're also connected to QDR ports so show up as
>>>>>>> QDR in the topology.
>>>>>>>
>>>>>>> There are DDR CAs in Xsigo box but not sure whether or not they run
>>>>>>> IPoIB.
>>>>>>>
>>>>>>> -- Hal
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>
>>>>>>>> Since you guys are amazingly helpful, I thought I would pick your
>>>>>>>> brains in a new problem.
>>>>>>>>
>>>>>>>> We have two Xsigo directors cross connected to four Mellanox IS5030
>>>>>>>> switches. Connected to those we have four Dell m1000e chassis each with two
>>>>>>>> IB switches (two chassis have QDR and two have FDR10). We have 9 dual-port
>>>>>>>> rack servers connected to the IS5030 switches. For testing purposes we have
>>>>>>>> an additional Dell m1000e QDR chassis connected to one Xsigo director and
>>>>>>>> two dual-port FDR10 rack servers connected to the other Xsigo director.
>>>>>>>>
>>>>>>>> I can get IPoIB to work between the two test rack servers connected
>>>>>>>> to the one Xsigo director. But I can not get IPoIB to work between any
>>>>>>>> blades either right next to each other or to the working rack servers. I'm
>>>>>>>> using the same exact live CentOS ISO on all four servers. I've checked
>>>>>>>> opensm and the blades have joined the multicast group 0xc000 properly.
>>>>>>>> tcpdump basically says that traffic is not leaving the blades. tcpdump also
>>>>>>>> shows no traffic entering the blades from the rack servers. An ibtracert
>>>>>>>> using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>
>>>>>>>> I've read about MulticastFDBTop=0xBFFF but I don't know how to set
>>>>>>>> it and I doubt it would have been set by default.
>>>>>>>>
>>>>>>>> Anyone have some ideas on troubleshooting steps to try? I think
>>>>>>>> Google is tired of me asking questions about it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Robert LeBlanc
>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>> Brigham Young University
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at lists.openfabrics.org
>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131028/ed7c65d9/attachment.html>