[Users] Weird IPoIB issue
Robert LeBlanc
robert_leblanc at byu.edu
Mon Oct 28 11:24:22 PDT 2013
Yes, I can not ping them over the IPoIB interface. It is a very simple
network set-up.
desxi003
8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state
UP qlen 256
link/infiniband
80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 192.168.9.3/24 brd 192.168.9.255 scope global ib0
inet6 fe80::f24d:a290:9778:e7d1/64 scope link
valid_lft forever preferred_lft forever
desxi004
8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state
UP qlen 256
link/infiniband
80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 192.168.9.4/24 brd 192.168.9.255 scope global ib0
inet6 fe80::f24d:a290:9778:e715/64 scope link
valid_lft forever preferred_lft forever
Robert LeBlanc
OIT Infrastructure & Virtualization Engineer
Brigham Young University
On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:
> So these 2 hosts have trouble talking IPoIB to each other ?
>
>
> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>
>> I was just wondering about that. It seems reasonable that the broadcast
>> traffic would go over multicast, but effectively channels would be created
>> for node to node communication, otherwise the entire multicast group would
>> be limited to 10 Gbps (in this instance) for the whole group. That doesn't
>> scale very well.
>>
>> The things I've read about IPoIB performance tuning seems pretty vague,
>> and the changes most people recommend seem to be already in place on the
>> systems I'm using. Some people said, try using a newer version of Ubuntu,
>> but ultimately, I have very little control over VMware. Once I can get the
>> Linux machines to communicate IPoIB between the racks and blades, then I'm
>> going to turn my attention over to performance optimization. It doesn't
>> seem to make much sense to spend time there when it is not working at all
>> for most machines.
>>
>> I've done ibtracert between the two nodes, is that what you mean by
>> walking the route?
>>
>> [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>> From ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>> [1] -> switch 0x2c90200448ec8[17] lid 51 "Infiniscale-IV Mellanox
>> Technologies"
>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost HCA-1"
>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>>
>> [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>> From ca 0xf04da2909778e714 port 1 lid 55-55 "localhost HCA-1"
>> [1] -> switch 0x2c90200448ec8[18] lid 51 "Infiniscale-IV Mellanox
>> Technologies"
>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost HCA-1"
>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost HCA-1"
>>
>> As you can see, the route is on the same switch, the blades are right
>> next to each other.
>>
>>
>> Robert LeBlanc
>> OIT Infrastructure & Virtualization Engineer
>> Brigham Young University
>>
>>
>> On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>> hal.rosenstock at gmail.com> wrote:
>>
>>> Which mystery is explained ? The 10 Gbps is a multicast only limit and
>>> does not apply to unicast. The BW limitation you're seeing is due to other
>>> factors. There's been much written about IPoIB performance.
>>>
>>> If all the MC members are joined and routed, then the IPoIB connectivity
>>> issue is some other issue. Are you sure this is the case ? Did you walk the
>>> route between 2 nodes where you have a connectivity issue ?
>>>
>>>
>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>>
>>>> Well, that explains one mystery, now I need to figure out why it seems
>>>> the Dell blades are not passing the traffic.
>>>>
>>>>
>>>> Robert LeBlanc
>>>> OIT Infrastructure & Virtualization Engineer
>>>> Brigham Young University
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock <
>>>> hal.rosenstock at gmail.com> wrote:
>>>>
>>>>> Yes, that's the IPoIB IPv4 broadcast group for the default (0xffff)
>>>>> partition. 0x80 part of mtu and rate just means "is equal to". mtu 0x04 is
>>>>> 2K (2048) and rate 0x3 is 10 Gb/sec. These are indeed the defaults.
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>> robert_leblanc at byu.edu> wrote:
>>>>>
>>>>>> The info for that MGID is:
>>>>>> MCMemberRecord group dump:
>>>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>>> Mlid....................0xC000
>>>>>> Mtu.....................0x84
>>>>>> pkey....................0xFFFF
>>>>>> Rate....................0x83
>>>>>> SL......................0x0
>>>>>>
>>>>>> I don't understand the MTU and Rate (130 and 131 dec). When I run
>>>>>> iperf between the two hosts over IPoIB in connected mode and MTU 65520.
>>>>>> I've tried multiple threads, but the sum is still 10 Gbps.
>>>>>>
>>>>>>
>>>>>> Robert LeBlanc
>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>> Brigham Young University
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>
>>>>>>> saquery -g should show what MGID is mapped to MLID 0xc000 and the
>>>>>>> group parameters.
>>>>>>>
>>>>>>> When you say 10 Gbps max, is that multicast or unicast ? That limit
>>>>>>> is only on the multicast.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc <
>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>
>>>>>>>> Well, that can explain why I'm only able to get 10 Gbps max from
>>>>>>>> the two hosts that are working.
>>>>>>>>
>>>>>>>> I have tried updn and dnup and they didn't help either. I think the
>>>>>>>> only thing that will help is Automatic Path Migration is it tries very hard
>>>>>>>> to route the alternative LIDs through different systemguids. I suspect it
>>>>>>>> would require re-LIDing everything which would mean an outage. I'm still
>>>>>>>> trying to get answers from Oracle if that is even a possibility. I've tried
>>>>>>>> seeding some of the algorithms with information like root nodes, etc, but
>>>>>>>> none of them worked better.
>>>>>>>>
>>>>>>>> The MLID 0xc000 exists and I can see all the nodes joined to the
>>>>>>>> group using saquery. I've checked the route using ibtracert specifying the
>>>>>>>> MLID. The only thing I'm not sure how to check is the group parameters.
>>>>>>>> What tool would I use for that?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>> Robert LeBlanc
>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>> Brigham Young University
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2013 at 11:16 AM, Hal Rosenstock <
>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Xsigo's SM is not "straight" OpenSM. They have some proprietary
>>>>>>>>> enhancements and it may be based on old vintage of OpenSM. You will likely
>>>>>>>>> need to work with them/Oracle now on issues.
>>>>>>>>>
>>>>>>>>> Lack of a partitions file does mean default partition and default
>>>>>>>>> rate (10 Gbps) so from what I saw all ports had sufficient rate to join MC
>>>>>>>>> group.
>>>>>>>>>
>>>>>>>>> There are certain topology requirements for running various
>>>>>>>>> routing algorithms. Did you try updn or dnup ?
>>>>>>>>>
>>>>>>>>> The key is determining whether the IPoIB broadcast group is setup
>>>>>>>>> correctly. What MLID is the group built on (usually 0xc000) ? What are the
>>>>>>>>> group parameters (rate, MTU) ? Are all members that are running IPoIB
>>>>>>>>> joined ? Is the group routed to all such members ? There are
>>>>>>>>> infiniband-diags for all of this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert LeBlanc <
>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>
>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage it) is using minhop.
>>>>>>>>>> I've loaded the ibnetdiscover output into ibsim and run all the different
>>>>>>>>>> routing algorithms against it with and without scatter ports. Minhop had
>>>>>>>>>> 50% of our hosts running all paths through a single IS5030 switch (at least
>>>>>>>>>> the LIDs we need which represent Ethernet and Fibre Channel cards the hosts
>>>>>>>>>> should communicate with). Ftree, dor, and dfsssp failed back to minhop, the
>>>>>>>>>> others routed more paths through the same IS5030 in some cases increasing
>>>>>>>>>> our host count with single point of failure to 75%.
>>>>>>>>>>
>>>>>>>>>> As far as I can tell there is no partitions.conf file so I assume
>>>>>>>>>> we are using the default partition. There is an opensm.opts file, but it
>>>>>>>>>> only specifies logging information.
>>>>>>>>>> # SA database file name
>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>
>>>>>>>>>> # If TRUE causes OpenSM to dump SA database at the end of
>>>>>>>>>> # every light sweep, regardless of the verbosity level
>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>
>>>>>>>>>> # The directory to hold the file OpenSM dumps
>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>
>>>>>>>>>> The SM node is:
>>>>>>>>>> xsigoa:/opt/xsigo/xsigos/current/ofed/etc# ibaddr
>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end 0x1
>>>>>>>>>>
>>>>>>>>>> We do have Switch-X in two of the Dell m1000e chassis but the
>>>>>>>>>> cards, ports 17-32, are FDR10 (the switch may be straight FDR, but I'm not
>>>>>>>>>> 100% sure). The IS5030 are QDR which the Switch-X are connected to, the
>>>>>>>>>> switches in the Xsigo directors are QDR, but the Ethernet and Fibre Channel
>>>>>>>>>> cards are DDR. The DDR cards will not be running IPoIB (at least to my
>>>>>>>>>> knowledge they don't have the ability), only the hosts should be leveraging
>>>>>>>>>> IPoIB. I hope that clears up some of your questions. If you have more, I
>>>>>>>>>> will try to answer them.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Robert LeBlanc
>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>> Brigham Young University
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 28, 2013 at 9:57 AM, Hal Rosenstock <
>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> What routing algorithm is configured in OpenSM ? What does your
>>>>>>>>>>> partitions.conf file look like ? Which node is your OpenSM ?
>>>>>>>>>>>
>>>>>>>>>>> Also, I only see QDR and DDR links although you have Switch-X so
>>>>>>>>>>> I assume all FDR ports are connected to slower (QDR) devices. I don't see
>>>>>>>>>>> any FDR-10 ports but maybe they're also connected to QDR ports so show up
>>>>>>>>>>> as QDR in the topology.
>>>>>>>>>>>
>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure whether or not they
>>>>>>>>>>> run IPoIB.
>>>>>>>>>>>
>>>>>>>>>>> -- Hal
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Oct 27, 2013 at 9:46 PM, Robert LeBlanc <
>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Since you guys are amazingly helpful, I thought I would pick
>>>>>>>>>>>> your brains in a new problem.
>>>>>>>>>>>>
>>>>>>>>>>>> We have two Xsigo directors cross connected to four Mellanox
>>>>>>>>>>>> IS5030 switches. Connected to those we have four Dell m1000e chassis each
>>>>>>>>>>>> with two IB switches (two chassis have QDR and two have FDR10). We have 9
>>>>>>>>>>>> dual-port rack servers connected to the IS5030 switches. For testing
>>>>>>>>>>>> purposes we have an additional Dell m1000e QDR chassis connected to one
>>>>>>>>>>>> Xsigo director and two dual-port FDR10 rack servers connected to the other
>>>>>>>>>>>> Xsigo director.
>>>>>>>>>>>>
>>>>>>>>>>>> I can get IPoIB to work between the two test rack servers
>>>>>>>>>>>> connected to the one Xsigo director. But I can not get IPoIB to work
>>>>>>>>>>>> between any blades either right next to each other or to the working rack
>>>>>>>>>>>> servers. I'm using the same exact live CentOS ISO on all four servers. I've
>>>>>>>>>>>> checked opensm and the blades have joined the multicast group 0xc000
>>>>>>>>>>>> properly. tcpdump basically says that traffic is not leaving the blades.
>>>>>>>>>>>> tcpdump also shows no traffic entering the blades from the rack servers. An
>>>>>>>>>>>> ibtracert using 0xc000 mlid shows that routing exists between hosts.
>>>>>>>>>>>>
>>>>>>>>>>>> I've read about MulticastFDBTop=0xBFFF but I don't know how to
>>>>>>>>>>>> set it and I doubt it would have been set by default.
>>>>>>>>>>>>
>>>>>>>>>>>> Anyone have some ideas on troubleshooting steps to try? I think
>>>>>>>>>>>> Google is tired of me asking questions about it.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131028/5bd804ee/attachment.html>
More information about the Users
mailing list