[ofa-general] RcvSwRelayErrors
Bernd Schubert
bs at q-leap.de
Thu Mar 20 08:10:13 PDT 2008
On Thursday 20 March 2008 15:41:40 Hal Rosenstock wrote:
> On Thu, 2008-03-20 at 15:33 +0100, Bernd Schubert wrote:
> > On Thursday 20 March 2008 15:29:35 Hal Rosenstock wrote:
> > > On Thu, 2008-03-20 at 15:27 +0100, Bernd Schubert wrote:
> > > > On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote:
> > > > > On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote:
> > > > > > On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
> > > > > > > On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > on one of our systems we get a rather huge numbers of
> > > > > > > > RcvSwRelayErrors. All I find about RcvSwRelayErrors is
> > > > > > > >
> > > > > > > > "This counter can increase due to a valid network event"
> > > > > > > >
> > > > > > > > But what might cause?
> > > > > >
> > > > > > Ooops. This should have been "But what might cause it?"
> > > > > >
> > > > > > > Are you running IB multicast (e.g. IPoIB) ? That's the most
> > > > > > > common cause.
> > > > > >
> > > > > > IPoIB is up, but so far only used initially by lustre for initial
> > > > > > lnet o2ib setup, but then AFAIK not any more. I think some MPI
> > > > > > stacks/applications also do their intial connection using IPoIB.
> > > > > >
> > > > > > But in general, once these connections are established, IPoIB is
> > > > > > not much used anymore.
> > > > >
> > > > > The causes are:
> > > > > 1. DLID mapping
> > > > > 2. VL mapping
> > > > > 3. looping (out port = in port)
> > > > >
> > > > > Is your subnet unstable in some way ? Are you using QoS ?
> > > >
> > > > We have seen some odd problems with opensm (from ofef-1.2.5) in the
> > > > past and once only rebooting the switches did help.
> > >
> > > You might want to update OpenSM to OFED 1.3 version.
> >
> > I won't manage to build new debian packages today, but I will do over
> > Easter. Hope to also find the time to clean the debian rules a bit, to
> > have it officially included in Debian.
> >
> > But will a new opensm help for these errors?
>
> Perhaps; but not knowing more about the cause it's hard to say. It might
> be interesting to see if there are any errors in your OpenSM log.
Well, these opensm logs are a big mystery for me, I have not the slightest
idea, what it wants to tell me with this:
osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41
Can I find some doku somewhere or does only reading the source help?
Here's the logs from the last day:
Mar 19 17:20:53 463683 [44007960] -> SUBNET UP
Mar 20 10:22:27 864281 [44007960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x002E TID:0x000000000000001f
Mar 20 10:22:27 864533 [44007960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x002E GID:0xfe80000000000000,0x000b8cffff002b50
Mar 20 10:22:28 153211 [42003960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35
Mar 20 10:22:28 153231 [42003960] -> __osm_drop_mgr_remove_port: Removed port with GUID:0x0002c902002587c6 LID range [0xF7,0xF7] of node:MT25408 ConnectX Mellanox Technologies
Mar 20 10:22:28 192987 [42003960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches
Mar 20 10:22:28 270978 [44007960] -> SUBNET UP
Mar 20 10:25:50 333350 [44007960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x002E TID:0x0000000000000020
Mar 20 10:25:50 333579 [44007960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x002E GID:0xfe80000000000000,0x000b8cffff002b50
Mar 20 10:25:50 644817 [42003960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35
Mar 20 10:25:50 644840 [42003960] -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x0002c902002587c6 LID range [0xF7,0xF7] of node:MT25408 ConnectX Mellanox Technologies
Mar 20 10:25:50 679661 [42003960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches
Mar 20 10:25:50 755437 [41001960] -> SUBNET UP
Mar 20 14:24:04 611501 [42003960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0020 TID:0x0000000000000051
Mar 20 14:24:04 611713 [42003960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41
Mar 20 14:24:04 913422 [44808960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35
Mar 20 14:24:04 913444 [44808960] -> __osm_drop_mgr_remove_port: Removed port with GUID:0x0002c9020025871d LID range [0x8,0x8] of node:MT25408 ConnectX Mellanox Technologies
Mar 20 14:24:04 952959 [44808960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches
Mar 20 14:24:05 027280 [41802960] -> SUBNET UP
Mar 20 14:26:49 795337 [41802960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0020 TID:0x0000000000000052
Mar 20 14:26:49 795578 [41802960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41
Mar 20 14:26:50 096861 [42804960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35
Mar 20 14:26:50 096874 [42804960] -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x0002c9020025871d LID range [0x8,0x8] of node:MT25408 ConnectX Mellanox Technologies
Mar 20 14:26:50 131620 [42804960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches
Mar 20 14:26:50 207641 [43806960] -> SUBNET UP
Mar 20 14:28:06 751962 [43806960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0008 TID:0x0000000000000000
Thanks again for your help,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
More information about the general
mailing list