[ofa-general] RcvSwRelayErrors
Ira Weiny
weiny2 at llnl.gov
Thu Mar 20 10:23:47 PDT 2008
Bernd,
On Thu, 20 Mar 2008 13:54:53 +0100
Bernd Schubert <bs at q-leap.de> wrote:
> On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
> > On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
> > > Hello,
> > >
> > > on one of our systems we get a rather huge numbers of RcvSwRelayErrors.
> > > All I find about RcvSwRelayErrors is
> > >
> > > "This counter can increase due to a valid network event"
> > >
> > > But what might cause?
>
> Ooops. This should have been "But what might cause it?"
>
> >
> > Are you running IB multicast (e.g. IPoIB) ? That's the most common
> > cause.
>
> IPoIB is up, but so far only used initially by lustre for initial lnet o2ib
> setup, but then AFAIK not any more. I think some MPI stacks/applications also
> do their intial connection using IPoIB.
>
> But in general, once these connections are established, IPoIB is not much used
> anymore.
>
Just FYI, we completely ignore these errors here. Perhaps not what we should
do but we run a number of things over IPoIB.
I suggest you try to clear the errors, run for ~1 hour and then check them.
Could you report back an approximate error rate using this method?
Ira
More information about the general
mailing list