[ewg] Infiniband Interoperability

David Brean david.brean at oracle.com
Wed Jul 7 17:04:25 PDT 2010


Correct, a SM hasn't been released for OpenSolaris, yet.

Looks like a very unusual multicast address because it doesn't have the IPoIB or Subnet Administrator signature.

-David

On 7/7/10 6:37 PM, Matt Breitbach wrote:
> We disconnected one port on the IB card that was a dual port card.  The
> second port was not configured, so I can't imagine it caused problems,
> but it is completely disconnected now.
>
> As for a Subnet Manager on OpenSolaris - there isn't one. I believe they
> do have one for Solaris, but I do not believe that it's been released to
> OpenSolaris, and I can't find it anywhere on our system.
>
> ------------------------------------------------------------------------
>
> *From:* richard at informatix-sol.com [mailto:richard at informatix-sol.com]
> *Sent:* Thursday, July 01, 2010 12:54 AM
> *To:* Matt Breitbach; ewg at lists.openfabrics.org
> *Subject:* Re: [ewg] Infiniband Interoperability
>
> When I had multiple SM's running none reported it as a problem.
> Sun developed their own for Solaris. I can't recall now what they called it.
>
> The other possibility i've seen cause problems with ipoib is having 2
> ports on the same IP subnet. Either bond them or disable ARP responses
> on one port. This is due to the broadcast simulation across multicast.
>
>
> Richard
>
> ----- Reply message -----
> From: "Matt Breitbach" <matthewb at flash.shanje.com>
> Date: Wed, Jun 30, 2010 19:32
> Subject: [ewg] Infiniband Interoperability
> To: <richard at informatix-sol.com>, <ewg at lists.openfabrics.org>
>
> The Mellanox switch as far as I can tell does not have any SM running. It
> is a pretty dumb switch and there really isn't much to configure on it.
>
>
>
> LID 6 is the LID that OpenSM is running on - which is our CentOS 5.5 blade.
> I believe that it's reporting the issue since it's the Subnet Manager.
>
>
>
> The only other possibility is that there is a subnet manager running on the
> OpenSolaris box, but I have not been able to find one to blame this on. I
> would also think that in the OpenSM.log file I would find some reports of an
> election of some sort if there were multiple SM's running.
>
>
>
> LID listing :
>
>
>
> LID 1 - SuperMicro 4U running OpenSolaris (InfiniHost EX III PCI-E card w/
> 128MB RAM)
>
> LID 2 - Blade Server currently running CentOS 5.5 and Xen (ConnectX
> Mezzanine card)
>
> LID 3 - InfiniScale III Switch
>
> LID 4 - SuperMicro 4U running OpenSolaris (InfiniHost EX III PCI-E card w/
> 128MB RAM - 2nd port)
>
> LID 5 - Blade Server running Windows 2008R2 (ConnectX Mezzanine card)
>
> LID 6 - Blade Server running CentOS 5.5 and OpenSM (ConnectX Mezzanine card)
>
> LID 7 - Blade Server running Windows 2008 (InfiniHost EX III Mezzanine card)
>
>
>
> As for toggling the enable state - according to ibdiagnet the lowest
> connected rate member is at 20Gbps, but the network is only operating at
> 10Gbps. I'm not sure which system I would toggle the enable state for.
>
>
>
> -Matt Breitbach
>
> _____
>
> From: richard at informatix-sol.com [mailto:richard at informatix-sol.com]
> Sent: Wednesday, June 30, 2010 1:14 PM
> To: Matt Breitbach; ewg at lists.openfabrics.org
> Subject: Re: [ewg] Infiniband Interoperability
>
>
>
> I'm still suspicious that you have more than one SM running. Mellonex
> switches have it enabled by default.
> It's common that ARP requests, as caused by ping, will result in multicast
> group activity.
> Infiniband creates these on demand and tears them down if there are no
> current members. There is no broadcast address. It uses a dedicated MC
> group.
> They all seem to originate to LID 6 so you can trace the source.
>
> If you have ports at non optimal speeds, try toggling their enable state.
> This often fixes it.
>
> Richard
>
> ----- Reply message -----
> From: "Matt Breitbach" <matthewb at flash.shanje.com>
> Date: Wed, Jun 30, 2010 15:33
> Subject: [ewg] Infiniband Interoperability
> To: <ewg at lists.openfabrics.org>
>
> Well, let me throw out a little about the environment :
>
>
>
> We are running one SuperMicro 4U system with a Mellanox InfiniHost III EX
> card w/ 128MB RAM. This box is the OpenSolaris box. It's running the
> OpenSolaris Infiniband stack, but no SM. Both ports are cabled to the IB
> Switch to ports 1 and 2.
>
>
>
> The other systems are in a SuperMicro Bladecenter. The switch in the
> BladeCenter is an InfiniScale III switch with 10 internal ports and 10
> external ports.
>
>
>
> 3 blades are connected with Mellanox ConnectX Mezzanine cards. 1 blade is
> connected with an InfiniHost III EX Mezzanine card.
>
>
>
> One of the blades is running CentOS and the 1.5.1 OFED release. OpenSM is
> running on that system, and is the only SM running on the network. This
> blade is using a ConnectX Mezzanine card.
>
>
>
> One blade is running Windows 2008 with the latest OFED drivers installed.
> It is using an InfiniHost III EX Mezzanine card.
>
>
>
> One blade is running Windows 2008 R2 with the latest OFED drivers installed.
> It is using an ConnectX Mezzanine card.
>
>
>
> One blade has been switching between Windows 2008 R2 and CentOS with Xen.
> Windows 2008 is running the latest OFED drivers, CentOS is running the 1.5.2
> RC2. That blade is using a ConnectX Mezzanine card.
>
>
>
> All of the firmware has been updated on the Mezzanine cards, the PCI-E
> InfiniHost III EX card, and the switch. All of the Windows boxes are
> configured to use Connected mode. I have not changed any other settings on
> the Linux boxes.
>
>
>
> As of right now, the network seems stable. I've been running pings for the
> last 12 hours, and nothing has dropped.
>
>
>
> I did notice in the OpenSM log though some odd entries that I do not believe
> belong there.
>
>
>
> Jun 30 06:56:26 832438 [B5723B90] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:67 (Mcast group deleted) from LID:6
> GID:ff12:1405:ffff::3333:1:2
>
> Jun 30 06:57:53 895990 [B5723B90] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:6
> GID:ff12:1405:ffff::3333:1:2
>
> Jun 30 07:18:06 770861 [B6124B90] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:67 (Mcast group deleted) from LID:6
> GID:ff12:1405:ffff::3333:1:2
>
> Jun 30 07:19:14 835273 [B5723B90] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:66 (New mcast group created) from LID:6
> GID:ff12:1405:ffff::3333:1:2
>
>
>
>
>
> I would not think that new mcast groups should be created or deleted when
> there are no new adapters being added to the network, especially in this
> small of a network. Is it odd to see those messages?
>
>
>
> Also, I have a warning when I run ibdiagnet - "Suboptimal rate for group.
> Lowest member rate: 20Gbps > group-rat
>
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg



More information about the ewg mailing list