[Users] Weird IPoIB issue

Hal Rosenstock hal.rosenstock at gmail.com
Mon Oct 28 11:33:42 PDT 2013


Are those ESXi IPoIB interfaces ? Do some of these work and others not ?
Are there normal Linux IPoIB interfaces ? Do they work ?


On Mon, Oct 28, 2013 at 2:24 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:

> Yes, I can not ping them over the IPoIB interface. It is a very simple
> network set-up.
>
> desxi003
> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state
> UP qlen 256
>     link/infiniband
> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>     inet 192.168.9.3/24 brd 192.168.9.255 scope global ib0
>     inet6 fe80::f24d:a290:9778:e7d1/64 scope link
>        valid_lft forever preferred_lft forever
>
> desxi004
> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state
> UP qlen 256
>     link/infiniband
> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>     inet 192.168.9.4/24 brd 192.168.9.255 scope global ib0
>     inet6 fe80::f24d:a290:9778:e715/64 scope link
>        valid_lft forever preferred_lft forever
>
>
>
> Robert LeBlanc
> OIT Infrastructure & Virtualization Engineer
> Brigham Young University
>
>
> On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock <hal.rosenstock at gmail.com
> > wrote:
>
>> So these 2 hosts have trouble talking IPoIB to each other ?
>>
>>
>> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>
>>> I was just wondering about that. It seems reasonable that the broadcast
>>> traffic would go over multicast, but effectively channels would be created
>>> for node to node communication, otherwise the entire multicast group would
>>> be limited to 10 Gbps (in this instance) for the whole group. That doesn't
>>> scale very well.
>>>
>>> The things I've read about IPoIB performance tuning seems pretty vague,
>>> and the changes most people recommend seem to be already in place on the
>>> systems I'm using. Some people said, try using a newer version of Ubuntu,
>>> but ultimately, I have very little control over VMware. Once I can get the
>>> Linux machines to communicate IPoIB between the racks and blades, then I'm
>>> going to turn my attention over to performance optimization. It doesn't
>>> seem to make much sense to spend time there when it is not working at all
>>> for most machines.
>>>
>>> I've done ibtracert between the two nodes, is that what you mean by
>>> walking the route?
>>>
>>> [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>>> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV Mellanox
>>> Technologies"
>>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
>>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>>
>>> [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>>> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV Mellanox
>>> Technologies"
>>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
>>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>>
>>> As you can see, the route is on the same switch, the blades are right
>>> next to each other.
>>>
>>>
>>> Robert LeBlanc
>>> OIT Infrastructure & Virtualization Engineer
>>> Brigham Young University
>>>
>>>
>>> On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>>> hal.rosenstock at gmail.com> wrote:
>>>
>>>> Which mystery is explained ? The 10 Gbps is a multicast only limit and
>>>> does not apply to unicast. The BW limitation you're seeing is due to other
>>>> factors. There's been much written about IPoIB performance.
>>>>
>>>> If all the MC members are joined and routed, then the IPoIB
>>>> connectivity issue is some other issue. Are you sure this is the case ? Did
>>>> you walk the route between 2 nodes where you have a connectivity issue ?
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <robert_leblanc at byu.edu
>>>> > wrote:
>>>>
>>>>> Well, that explains one mystery, now I need to figure out why it seems
>>>>> the Dell blades are not passing the traffic.
>>>>>
>>>>>
>>>>> Robert LeBlanc
>>>>> OIT Infrastructure & Virtualization Engineer
>>>>> Brigham Young University
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>
>>>>>> Yes, that's the IPoIB IPv4 broadcast group for the default (0xffff)
>>>>>> partition. 0x80 part of mtu and rate just means "is equal to". mtu 0x04 is
>>>>>> 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the defaults.
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>
>>>>>>> The info for that MGID is:
>>>>>>> MCMemberRecord group dump:
>>>>>>>                 MGID....................ff12:401b:ffff::ffff:ffff
>>>>>>>                 Mlid....................0xC000
>>>>>>>                 Mtu.....................0x84
>>>>>>>                 pkey....................0xFFFF
>>>>>>>                 Rate....................0x83
>>>>>>>                 SL......................0x0
>>>>>>>
>>>>>>> I don't understand the MTU and Rate (130 and 131 dec). When I run
>>>>>>> iperf between the two hosts over IPoIB in connected mode and MTU 65520.
>>>>>>> I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>>>
>>>>>>>
>>>>>>> Robert LeBlanc
>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>> Brigham Young University
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>
>>>>>>>> saquery -g should show what MGID is mapped to MLID 0xc000 and the
>>>>>>>> group parameters.
>>>>>>>>
>>>>>>>> When you say 10 Gbps max, is that multicast or unicast ? That limit
>>>>>>>> is only on the multicast.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>
>>>>>>>>> Well, that can explain why I'm only able to get 10 Gbps max from
>>>>>>>>> the two hosts that are working.
>>>>>>>>>
>>>>>>>>> I have tried updn and dnup and they didn't help either. I think
>>>>>>>>> the only thing that will help is Automatic Path Migration is it tries very
>>>>>>>>> hard to route the alternative LIDs through different systemguids. I suspect
>>>>>>>>> it would require re-LIDing everything which would mean an outage. I'm still
>>>>>>>>> trying to get answers from Oracle if that is even a possibility. I've tried
>>>>>>>>> seeding some of the algorithms with information like root nodes, etc, but
>>>>>>>>> none of them worked better.
>>>>>>>>>
>>>>>>>>> The MLID 0xc000 exists and I can see all the nodes joined to the
>>>>>>>>> group using saquery. I've checked the route using ibtracert specifying the
>>>>>>>>> MLID. The only thing I'm not sure how to check is the group parameters.
>>>>>>>>> What tool would I use for that?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Robert LeBlanc
>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>> Brigham Young University
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Xsigo's SM is not "straight" OpenSM. They have some proprietary
>>>>>>>>>> enhancements and it may be based on old vintage of OpenSM. You will likely
>>>>>>>>>> need to work with them/Oracle now on issues.
>>>>>>>>>>
>>>>>>>>>> Lack of a partitions file does mean default partition and default
>>>>>>>>>> rate (10 Gbps) so from what I saw all ports had sufficient rate to join MC
>>>>>>>>>> group.
>>>>>>>>>>
>>>>>>>>>> There are certain topology requirements for running various
>>>>>>>>>> routing algorithms. Did you try updn or dnup ?
>>>>>>>>>>
>>>>>>>>>> The key is determining whether the IPoIB broadcast group is setup
>>>>>>>>>> correctly. What MLID is the group built on (usually 0xc000) ? What are the
>>>>>>>>>> group parameters (rate, MTU) ? Are all members that are running IPoIB
>>>>>>>>>> joined ? Is the group routed to all such members ? There are
>>>>>>>>>> infiniband-diags for all of this.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is using minhop.
>>>>>>>>>>> I've loaded the ibnetdiscover output into ibsim and run all the different
>>>>>>>>>>> routing algorithms against it with and without scatter ports. Minhop had
>>>>>>>>>>> 50% of our hosts running all paths through a single IS5030 switch (at least
>>>>>>>>>>> the LIDs we need which represent Ethernet and Fibre Channel cards the hosts
>>>>>>>>>>> should communicate with). Ftree, dor, and dfsssp failed back to minhop, the
>>>>>>>>>>> others routed more paths through the same IS5030 in some cases increasing
>>>>>>>>>>> our host count with single point of failure to 75%.
>>>>>>>>>>>
>>>>>>>>>>> As far as I can tell there is no partitions.conf file so I
>>>>>>>>>>> assume we are using the default partition. There is an opensm.opts file,
>>>>>>>>>>> but it only specifies logging information.
>>>>>>>>>>> # SA database file name
>>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>>
>>>>>>>>>>> # If TRUE causes OpenSM to dump SA database at the end of
>>>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>>
>>>>>>>>>>> # The directory to hold the file OpenSM dumps
>>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>>
>>>>>>>>>>> The SM node is:
>>>>>>>>>>> xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>>>
>>>>>>>>>>> We do have Switch-X in two of the Dell m1000e chassis but the
>>>>>>>>>>> cards, ports 17-32, are FDR10 (the switch may be straight FDR, but I'm not
>>>>>>>>>>> 100% sure). The IS5030 are QDR which the Switch-X are connected to, the
>>>>>>>>>>> switches in the Xsigo directors are QDR, but the Ethernet and Fibre Channel
>>>>>>>>>>> cards are DDR. The DDR cards will not be running IPoIB (at least to my
>>>>>>>>>>> knowledge they don't have the ability), only the hosts should be leveraging
>>>>>>>>>>> IPoIB. I hope that clears up some of your questions. If you have more, I
>>>>>>>>>>> will try to answer them.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> What routing algorithm is configured in OpenSM ? What does your
>>>>>>>>>>>> partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>>>
>>>>>>>>>>>> Also, I only see QDR and DDR links although you have Switch-X
>>>>>>>>>>>> so I assume all FDR ports are connected to slower (QDR) devices. I don't
>>>>>>>>>>>> see any FDR-10 ports but maybe they're also connected to QDR ports so show
>>>>>>>>>>>> up as QDR in the topology.
>>>>>>>>>>>>
>>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether or not they
>>>>>>>>>>>> run IPoIB.
>>>>>>>>>>>>
>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Since you guys are amazingly helpful, I thought I would pick
>>>>>>>>>>>>> your brains in a new problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We have two Xsigo directors cross connected to four Mellanox
>>>>>>>>>>>>> IS5030 switches. Connected to those we have four Dell m1000e chassis each
>>>>>>>>>>>>> with two IB switches (two chassis have QDR and two have FDR10). We have 9
>>>>>>>>>>>>> dual-port rack servers connected to the IS5030 switches. For testing
>>>>>>>>>>>>> purposes we have an additional Dell m1000e QDR chassis connected to one
>>>>>>>>>>>>> Xsigo director and two dual-port FDR10 rack servers connected to the other
>>>>>>>>>>>>> Xsigo director.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can get IPoIB to work between the two test rack servers
>>>>>>>>>>>>> connected to the one Xsigo director. But I can not get IPoIB to work
>>>>>>>>>>>>> between any blades either right next to each other or to the working rack
>>>>>>>>>>>>> servers. I'm using the same exact live CentOS ISO on all four servers. I've
>>>>>>>>>>>>> checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've read about MulticastFDBTop=0xBFFF but I don't know how to
>>>>>>>>>>>>> set it and I doubt it would have been set by default.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anyone have some ideas on troubleshooting steps to try? I
>>>>>>>>>>>>> think Google is tired of me asking questions about it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131028/f98170e3/attachment.html>


More information about the Users mailing list