[ofa-general] RcvSwRelayErrors

Ira Weiny weiny2 at llnl.gov
Thu Mar 20 10:23:47 PDT 2008


Bernd,

On Thu, 20 Mar 2008 13:54:53 +0100
Bernd Schubert <bs at q-leap.de> wrote:

> On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
> > On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
> > > Hello,
> > >
> > > on one of our systems we get a rather huge numbers of RcvSwRelayErrors.
> > > All I find about RcvSwRelayErrors is
> > >
> > > "This counter can increase due to a valid network event"
> > >
> > > But what might cause?
> 
> Ooops. This should have been "But what might cause it?"
> 
> >
> > Are you running IB multicast (e.g. IPoIB) ? That's the most common
> > cause.
> 
> IPoIB is up, but so far only used initially by lustre for initial lnet o2ib 
> setup, but then AFAIK not any more. I think some MPI stacks/applications also 
> do their intial connection using IPoIB.
> 
> But in general, once these connections are established, IPoIB is not much used 
> anymore.
> 

Just FYI, we completely ignore these errors here.  Perhaps not what we should
do but we run a number of things over IPoIB.

I suggest you try to clear the errors, run for ~1 hour and then check them.
Could you report back an approximate error rate using this method?

Ira



More information about the general mailing list