[ofa-general] RcvSwRelayErrors

Bernd Schubert bs at q-leap.de
Thu Mar 20 07:27:38 PDT 2008


On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote:
> On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote:
> > On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
> > > On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
> > > > Hello,
> > > >
> > > > on one of our systems we get a rather huge numbers of
> > > > RcvSwRelayErrors. All I find about RcvSwRelayErrors is
> > > >
> > > > "This counter can increase due to a valid network event"
> > > >
> > > > But what might cause?
> >
> > Ooops. This should have been "But what might cause it?"
> >
> > > Are you running IB multicast (e.g. IPoIB) ? That's the most common
> > > cause.
> >
> > IPoIB is up, but so far only used initially by lustre for initial lnet
> > o2ib setup, but then AFAIK not any more. I think some MPI
> > stacks/applications also do their intial connection using IPoIB.
> >
> > But in general, once these connections are established, IPoIB is not much
> > used anymore.
>
> The causes are:
> 1. DLID mapping
> 2. VL mapping
> 3. looping (out port = in port)
>
> Is your subnet unstable in some way ? Are you using QoS ?
>

We have seen some odd problems with opensm (from ofef-1.2.5) in the past and 
once only rebooting the switches did help. 
Yesterday I started monitoring the the fabric and even though there's not much 
traffic, I immediately noticed these errors.

We are not using QoS.


Thanks for your help,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH



More information about the general mailing list