[ofa-general] RcvSwRelayErrors

Hal Rosenstock hrosenstock at xsigo.com
Thu Mar 20 08:13:02 PDT 2008


On Thu, 2008-03-20 at 16:10 +0100, Bernd Schubert wrote:
> On Thursday 20 March 2008 15:41:40 Hal Rosenstock wrote:
> > On Thu, 2008-03-20 at 15:33 +0100, Bernd Schubert wrote:
> > > On Thursday 20 March 2008 15:29:35 Hal Rosenstock wrote:
> > > > On Thu, 2008-03-20 at 15:27 +0100, Bernd Schubert wrote:
> > > > > On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote:
> > > > > > On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote:
> > > > > > > On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
> > > > > > > > On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > on one of our systems we get a rather huge numbers of
> > > > > > > > > RcvSwRelayErrors. All I find about RcvSwRelayErrors is
> > > > > > > > >
> > > > > > > > > "This counter can increase due to a valid network event"
> > > > > > > > >
> > > > > > > > > But what might cause?
> > > > > > >
> > > > > > > Ooops. This should have been "But what might cause it?"
> > > > > > >
> > > > > > > > Are you running IB multicast (e.g. IPoIB) ? That's the most
> > > > > > > > common cause.
> > > > > > >
> > > > > > > IPoIB is up, but so far only used initially by lustre for initial
> > > > > > > lnet o2ib setup, but then AFAIK not any more. I think some MPI
> > > > > > > stacks/applications also do their intial connection using IPoIB.
> > > > > > >
> > > > > > > But in general, once these connections are established, IPoIB is
> > > > > > > not much used anymore.
> > > > > >
> > > > > > The causes are:
> > > > > > 1. DLID mapping
> > > > > > 2. VL mapping
> > > > > > 3. looping (out port = in port)
> > > > > >
> > > > > > Is your subnet unstable in some way ? Are you using QoS ?
> > > > >
> > > > > We have seen some odd problems with opensm (from ofef-1.2.5) in the
> > > > > past and once only rebooting the switches did help.
> > > >
> > > > You might want to update OpenSM to OFED 1.3 version.
> > >
> > > I won't manage to build new debian packages today, but I will do over
> > > Easter. Hope to also find the time to clean the debian rules a bit, to
> > > have it officially included in Debian.
> > >
> > > But will a new opensm help for these errors?
> >
> > Perhaps; but not knowing more about the cause it's hard to say. It might
> > be interesting to see if there are any errors in your OpenSM log.
> 
> Well, these opensm logs are a big mystery for me, I have not the slightest 
> idea, what it wants to tell me with this:
> 
> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41
> 
> Can I find some doku somewhere or does only reading the source help?

AFAIK there's no doc and it takes some combination of looking at the
messages and the IB spec and possibly looking at the source as well.

> Here's the logs from the last day:
> 
> Mar 19 17:20:53 463683 [44007960] -> SUBNET UP
> Mar 20 10:22:27 864281 [44007960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x002E TID:0x000000000000001f
> Mar 20 10:22:27 864533 [44007960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x002E GID:0xfe80000000000000,0x000b8cffff002b50
> Mar 20 10:22:28 153211 [42003960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35
> Mar 20 10:22:28 153231 [42003960] -> __osm_drop_mgr_remove_port: Removed port with GUID:0x0002c902002587c6 LID range [0xF7,0xF7] of node:MT25408 ConnectX Mellanox Technologies
> Mar 20 10:22:28 192987 [42003960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches
> Mar 20 10:22:28 270978 [44007960] -> SUBNET UP
> Mar 20 10:25:50 333350 [44007960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x002E TID:0x0000000000000020
> Mar 20 10:25:50 333579 [44007960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x002E GID:0xfe80000000000000,0x000b8cffff002b50
> Mar 20 10:25:50 644817 [42003960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35
> Mar 20 10:25:50 644840 [42003960] -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x0002c902002587c6 LID range [0xF7,0xF7] of node:MT25408 ConnectX Mellanox Technologies
> Mar 20 10:25:50 679661 [42003960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches
> Mar 20 10:25:50 755437 [41001960] -> SUBNET UP
> Mar 20 14:24:04 611501 [42003960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0020 TID:0x0000000000000051
> Mar 20 14:24:04 611713 [42003960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41
> Mar 20 14:24:04 913422 [44808960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35
> Mar 20 14:24:04 913444 [44808960] -> __osm_drop_mgr_remove_port: Removed port with GUID:0x0002c9020025871d LID range [0x8,0x8] of node:MT25408 ConnectX Mellanox Technologies
> Mar 20 14:24:04 952959 [44808960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches
> Mar 20 14:24:05 027280 [41802960] -> SUBNET UP
> Mar 20 14:26:49 795337 [41802960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0020 TID:0x0000000000000052
> Mar 20 14:26:49 795578 [41802960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41
> Mar 20 14:26:50 096861 [42804960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35
> Mar 20 14:26:50 096874 [42804960] -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x0002c9020025871d LID range [0x8,0x8] of node:MT25408 ConnectX Mellanox Technologies
> Mar 20 14:26:50 131620 [42804960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches
> Mar 20 14:26:50 207641 [43806960] -> SUBNET UP
> Mar 20 14:28:06 751962 [43806960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0008 TID:0x0000000000000000

Looks like there is something going on with your ConnectX ports or are
these just booting up ?

-- Hal

> Thanks again for your help,
> Bernd
> 




More information about the general mailing list