[Users] OpenSM error message rosetta stone?

Ira Weiny weiny2 at llnl.gov
Wed Feb 20 18:29:29 PST 2013


On Wed, 20 Feb 2013 11:30:46 -0600
Narayan Desai <narayan.desai at gmail.com> wrote:

> OK, so after resolving the issues caused by two bad nodes at the end
> of those direct routes, I'm getting a few more messages that I'd like
> to be able to interpret:
> 
> Feb 20 09:46:26 428574 [22C64700] 0x02 -> SUBNET UP
> Feb 20 09:46:29 327451 [25C6A700] 0x02 -> log_notice: Reporting
> Generic Notice type:3 num:66 (New mcast group created) from LID:125
> GID:ff12:601b:ffff::1:ff0b:77bd
> Feb 20 09:46:29 327459 [25C6A700] 0x02 -> is_access_permitted: Cannot
> find destination port with LID:351
> Feb 20 09:46:29 327463 [25C6A700] 0x02 -> is_access_permitted: Cannot
> find destination port with LID:352
> Feb 20 09:46:29 331918 [2646B700] 0x01 -> mcmr_rcv_join_mgrp: ERR
> 1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> ff12:601b:ffff::16 from port 0x0002c903000b77bd (cm2-p)
> Feb 20 09:46:31 219667 [23C66700] 0x01 -> mcmr_rcv_join_mgrp: ERR
> 1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> ff12:601b:ffff::2 from port 0x0002c903000b77bd (cm2-p)
> Feb 20 09:46:35 228164 [24C68700] 0x01 -> mcmr_rcv_join_mgrp: ERR
> 1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> ff12:601b:ffff::2 from port 0x0002c903000b77bd (cm2-p)
> Feb 20 09:46:36 483868 [24467700] 0x01 -> mcmr_rcv_join_mgrp: ERR
> 1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> ff12:601b:ffff::16 from port 0x0002c903000b77bd (cm2-p)
> Feb 20 09:46:39 235990 [25469700] 0x01 -> mcmr_rcv_join_mgrp: ERR
> 1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> ff12:601b:ffff::2 from port 0x0002c903000b77bd (cm2-p)
> 
> So the SUBNET UP message means that opensm has successfully programmed
> all of the switches in the network, right? How often should I see
> those?

You should see those whenever a change in the subnet has caused OpenSM to reroute.

However, SUBNET UP is printed and reported to plugins when in fact there were errors programming the fabric.  I believe this to be a bug.  I have spoken with Hal, the new OpenSM maintainer, and he is looking into it because he believes there may be issues with the patch I submitted to change this behaviour...

> 
> What do the is_access_permitted messages mean?

Those lids had subscribed for traps from the SM and are apparently gone now thus the internal error reported when attempting to find those lids.

saquery IIR

should show you the InformInfoRecords with subscriber information.

> 
> And what do the ERR 1B11 messages mean? That was a node that was
> rebooted this morning and seems to be functioning properly.
> thanks again.

1B11 is caused by a port attempting to join a multicast group without that group having been created.  The proper sequence for group creation is for someone to join with the the following component mask bits set.

#define REQUIRED_MC_CREATE_COMP_MASK (IB_MCR_COMPMASK_MGID | \
					IB_MCR_COMPMASK_PORT_GID | \
					IB_MCR_COMPMASK_JOIN_STATE | \
					IB_MCR_COMPMASK_QKEY | \
					IB_MCR_COMPMASK_TCLASS | \
					IB_MCR_COMPMASK_PKEY | \
					IB_MCR_COMPMASK_FLOW | \
					IB_MCR_COMPMASK_SL)

I have seen this error before and the later OpenSM's have the ability to pre-create multicast groups which you may need.

See OpenSM PARTITION CONFIGURATION section.  (Mcast groups are per partition as specified in the pkey)

Ira

>  -nld
> 
> On Tue, Feb 19, 2013 at 4:17 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > On Tue, 19 Feb 2013 14:38:47 -0600
> > Narayan Desai <narayan.desai at gmail.com> wrote:
> >
> >> On Tue, Feb 19, 2013 at 1:03 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> >>
> >> >> It looks like some lines are being mixed; is this just a lack of a
> >> >> newline, or are the messages interspersed?
> >> >
> >> > Yes there is a bug here.  I submitted a patch but it was rejected because the newline was added as part of another patch.  So, I believe this is fixed in 3.3.16.
> >>
> >> This is just cosmetic, right?
> >
> > yes.
> > Ira
> >
> >>
> >> >>
> >> >> Does the initial path information identify the remote node having
> >> >> troubles? How can I turn that into usable coordinates?
> >> >
> >> > The DR path in this case is the node which the SM _can_ talk to (0,1,19,13 guid 0x0002c902004158b0).  The remote node which is not responding is on port 6 of that node.  Whatever is connected to port 6 is the problem node.
> >> >
> >> > The easiest way to trace this using the diags would be:
> >> >
> >> > iblinkinfo -D 0,1,19,13
> >> > or
> >> > iblinkinfo -G 0x0002c902004158b0
> >> >
> >> > It too will fail to query port 6 but it should give you a better idea of where in the fabric you are by looking at the other nodes connected to other ports...
> >>
> >> Thanks.
> >>  -nld
> >
> >
> > --
> > Ira Weiny
> > Member of Technical Staff
> > Lawrence Livermore National Lab
> > 925-423-8008
> > weiny2 at llnl.gov


-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov



More information about the Users mailing list