[Users] OpenSM error message rosetta stone?

Narayan Desai narayan.desai at gmail.com
Wed Feb 20 09:30:46 PST 2013


OK, so after resolving the issues caused by two bad nodes at the end
of those direct routes, I'm getting a few more messages that I'd like
to be able to interpret:

Feb 20 09:46:26 428574 [22C64700] 0x02 -> SUBNET UP
Feb 20 09:46:29 327451 [25C6A700] 0x02 -> log_notice: Reporting
Generic Notice type:3 num:66 (New mcast group created) from LID:125
GID:ff12:601b:ffff::1:ff0b:77bd
Feb 20 09:46:29 327459 [25C6A700] 0x02 -> is_access_permitted: Cannot
find destination port with LID:351
Feb 20 09:46:29 327463 [25C6A700] 0x02 -> is_access_permitted: Cannot
find destination port with LID:352
Feb 20 09:46:29 331918 [2646B700] 0x01 -> mcmr_rcv_join_mgrp: ERR
1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
ff12:601b:ffff::16 from port 0x0002c903000b77bd (cm2-p)
Feb 20 09:46:31 219667 [23C66700] 0x01 -> mcmr_rcv_join_mgrp: ERR
1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
ff12:601b:ffff::2 from port 0x0002c903000b77bd (cm2-p)
Feb 20 09:46:35 228164 [24C68700] 0x01 -> mcmr_rcv_join_mgrp: ERR
1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
ff12:601b:ffff::2 from port 0x0002c903000b77bd (cm2-p)
Feb 20 09:46:36 483868 [24467700] 0x01 -> mcmr_rcv_join_mgrp: ERR
1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
ff12:601b:ffff::16 from port 0x0002c903000b77bd (cm2-p)
Feb 20 09:46:39 235990 [25469700] 0x01 -> mcmr_rcv_join_mgrp: ERR
1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
ff12:601b:ffff::2 from port 0x0002c903000b77bd (cm2-p)

So the SUBNET UP message means that opensm has successfully programmed
all of the switches in the network, right? How often should I see
those?

What do the is_access_permitted messages mean?

And what do the ERR 1B11 messages mean? That was a node that was
rebooted this morning and seems to be functioning properly.
thanks again.
 -nld

On Tue, Feb 19, 2013 at 4:17 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> On Tue, 19 Feb 2013 14:38:47 -0600
> Narayan Desai <narayan.desai at gmail.com> wrote:
>
>> On Tue, Feb 19, 2013 at 1:03 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
>>
>> >> It looks like some lines are being mixed; is this just a lack of a
>> >> newline, or are the messages interspersed?
>> >
>> > Yes there is a bug here.  I submitted a patch but it was rejected because the newline was added as part of another patch.  So, I believe this is fixed in 3.3.16.
>>
>> This is just cosmetic, right?
>
> yes.
> Ira
>
>>
>> >>
>> >> Does the initial path information identify the remote node having
>> >> troubles? How can I turn that into usable coordinates?
>> >
>> > The DR path in this case is the node which the SM _can_ talk to (0,1,19,13 guid 0x0002c902004158b0).  The remote node which is not responding is on port 6 of that node.  Whatever is connected to port 6 is the problem node.
>> >
>> > The easiest way to trace this using the diags would be:
>> >
>> > iblinkinfo -D 0,1,19,13
>> > or
>> > iblinkinfo -G 0x0002c902004158b0
>> >
>> > It too will fail to query port 6 but it should give you a better idea of where in the fabric you are by looking at the other nodes connected to other ports...
>>
>> Thanks.
>>  -nld
>
>
> --
> Ira Weiny
> Member of Technical Staff
> Lawrence Livermore National Lab
> 925-423-8008
> weiny2 at llnl.gov



More information about the Users mailing list