[ofa-general] RcvSwRelayErrors

Hal Rosenstock hrosenstock at xsigo.com
Thu Mar 20 07:12:03 PDT 2008


On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote:
> On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
> > On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
> > > Hello,
> > >
> > > on one of our systems we get a rather huge numbers of RcvSwRelayErrors.
> > > All I find about RcvSwRelayErrors is
> > >
> > > "This counter can increase due to a valid network event"
> > >
> > > But what might cause?
> 
> Ooops. This should have been "But what might cause it?"
> 
> >
> > Are you running IB multicast (e.g. IPoIB) ? That's the most common
> > cause.
> 
> IPoIB is up, but so far only used initially by lustre for initial lnet o2ib 
> setup, but then AFAIK not any more. I think some MPI stacks/applications also 
> do their intial connection using IPoIB.
> 
> But in general, once these connections are established, IPoIB is not much used 
> anymore.

The causes are:
1. DLID mapping
2. VL mapping
3. looping (out port = in port)

Is your subnet unstable in some way ? Are you using QoS ?

-- Hal

> 
> Thanks,
> Bernd
> 
> 
> >
> > -- Hal
> >
> > > Thanks in advance for any help,
> > > Bernd
> > >
> > >
> > > [...]
> > >   11: [RcvSwRelayErrors == 189]
> > >    12: [RcvSwRelayErrors == 196]
> > >    16: [RcvSwRelayErrors == 34655]
> > > Errors for 0x000b8cffff002b33 "MT47396 Infiniscale-III Mellanox
> > > Technologies ()"
> > >    1: [RcvSwRelayErrors == 190]
> > >    2: [RcvSwRelayErrors == 188]
> > >    3: [RcvSwRelayErrors == 195]
> > >    4: [RcvSwRelayErrors == 207]
> > >    5: [RcvSwRelayErrors == 194]
> > >    6: [RcvSwRelayErrors == 189]
> > >    8: [RcvSwRelayErrors == 198]
> > >    9: [RcvSwRelayErrors == 197]
> > >    10: [RcvSwRelayErrors == 190]
> > >    11: [RcvSwRelayErrors == 198]
> > >    12: [RcvSwRelayErrors == 190]
> > >    16: [RcvSwRelayErrors == 34711]
> > > Errors for 0x000b8cffff002b43 "MT47396 Infiniscale-III Mellanox
> > > Technologies ()"
> > >    1: [RcvSwRelayErrors == 196]
> > >    3: [RcvSwRelayErrors == 242]
> > > [...]
> 
> 
> 




More information about the general mailing list