[ofa-general] RcvSwRelayErrors

Hal Rosenstock hrosenstock at xsigo.com
Thu Mar 20 07:41:40 PDT 2008


On Thu, 2008-03-20 at 15:33 +0100, Bernd Schubert wrote:
> On Thursday 20 March 2008 15:29:35 Hal Rosenstock wrote:
> > On Thu, 2008-03-20 at 15:27 +0100, Bernd Schubert wrote:
> > > On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote:
> > > > On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote:
> > > > > On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
> > > > > > On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > on one of our systems we get a rather huge numbers of
> > > > > > > RcvSwRelayErrors. All I find about RcvSwRelayErrors is
> > > > > > >
> > > > > > > "This counter can increase due to a valid network event"
> > > > > > >
> > > > > > > But what might cause?
> > > > >
> > > > > Ooops. This should have been "But what might cause it?"
> > > > >
> > > > > > Are you running IB multicast (e.g. IPoIB) ? That's the most common
> > > > > > cause.
> > > > >
> > > > > IPoIB is up, but so far only used initially by lustre for initial
> > > > > lnet o2ib setup, but then AFAIK not any more. I think some MPI
> > > > > stacks/applications also do their intial connection using IPoIB.
> > > > >
> > > > > But in general, once these connections are established, IPoIB is not
> > > > > much used anymore.
> > > >
> > > > The causes are:
> > > > 1. DLID mapping
> > > > 2. VL mapping
> > > > 3. looping (out port = in port)
> > > >
> > > > Is your subnet unstable in some way ? Are you using QoS ?
> > >
> > > We have seen some odd problems with opensm (from ofef-1.2.5) in the past
> > > and once only rebooting the switches did help.
> >
> > You might want to update OpenSM to OFED 1.3 version.
> 
> I won't manage to build new debian packages today, but I will do over Easter. 
> Hope to also find the time to clean the debian rules a bit, to have it 
> officially included in Debian.
> 
> But will a new opensm help for these errors?

Perhaps; but not knowing more about the cause it's hard to say. It might
be interesting to see if there are any errors in your OpenSM log.

-- Hal

> >
> > > Yesterday I started monitoring the the fabric and even though there's not
> > > much traffic, I immediately noticed these errors.
> >
> > Were the counters cleared before you started looking ?
> 
> Yes, sure.
> 
> 
> Thanks a lot for your help,
> Bernd
> 




More information about the general mailing list