[Users] Weird IPoIB issue

Robert LeBlanc robert_leblanc at byu.edu
Sun Oct 27 20:29:57 PDT 2013


That is something to try. I would think either the QDR (the IS5030 swithes
are QDR) or the FDR would work, but the only thing that works are the two
rack mounts that don't have to go through an uplink. But the blades are on
the same switch and don't have to go through an uplink either. I'm willing
to try anything at this point.

I was wondering if the mesh type fabric could be causing some problems, but
it seems that multicast should route like unicast would. The two chassis
switches are cross connected to two of the IS5030s to try and provide
maximum redundancy (which in a complete ironic turn of events actually
reduces redundancy because simulating different routing algorithms in ibsim
produce 50% or more paths that we care about through one switch at one
point along the path). I think the only options at this point are to try to
get Oracle to leverage Automatic Path Migration, write a new routing
algorithm or change the physical cabling to remove the cross connections.

Thanks for the suggestion! Please keep them coming.


Robert LeBlanc
OIT Infrastructure & Virtualization Engineer
Brigham Young University


On Sun, Oct 27, 2013 at 9:10 PM, Narayan Desai <narayan.desai at gmail.com>wrote:

> Have you set a rate for ipoib in opensm? (specifically in partitions.conf)
> This controls the speed of the multicast group associated with IPoIB. I've
> seen issues when a node can't satisfy that rate; that will cause it not to
> work properly on IPoIB, but still show good connectivity on raw IB. (This
> is what your comments suggest; ibtracert works, but IPoIB doesn't)
> Considering your mix of link speeds, I bet this (or something like it) is
> it.
>
> I'm not sure that tcpdump would ever show you anything useful for IPoIB,
> when it isn't working at that low of a level. Can anyone say for sure?
> -nld
>
>
> On Sun, Oct 27, 2013 at 8:46 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>
>> Since you guys are amazingly helpful, I thought I would pick your brains
>> in a new problem.
>>
>> We have two Xsigo directors cross connected to four Mellanox IS5030
>> switches. Connected to those we have four Dell m1000e chassis each with two
>> IB switches (two chassis have QDR and two have FDR10). We have 9 dual-port
>> rack servers connected to the IS5030 switches. For testing purposes we have
>> an additional Dell m1000e QDR chassis connected to one Xsigo director and
>> two dual-port FDR10 rack servers connected to the other Xsigo director.
>>
>> I can get IPoIB to work between the two test rack servers connected to
>> the one Xsigo director. But I can not get IPoIB to work between any blades
>> either right next to each other or to the working rack servers. I'm using
>> the same exact live CentOS ISO on all four servers. I've checked opensm and
>> the blades have joined the multicast group 0xc000 properly. tcpdump
>> basically says that traffic is not leaving the blades. tcpdump also shows
>> no traffic entering the blades from the rack servers. An ibtracert using
>> 0xc000 mlid shows that routing exists between hosts.
>>
>> I've read about MulticastFDBTop=0xBFFF but I don't know how to set it and
>> I doubt it would have been set by default.
>>
>> Anyone have some ideas on troubleshooting steps to try? I think Google is
>> tired of me asking questions about it.
>>
>> Thanks,
>>
>> Robert LeBlanc
>> OIT Infrastructure & Virtualization Engineer
>> Brigham Young University
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131027/bd0c8ef7/attachment.html>


More information about the Users mailing list