[Users] OpenSM error message rosetta stone?

Narayan Desai narayan.desai at gmail.com
Tue Feb 19 09:30:38 PST 2013


We're running 3.3.15-0.2 from my PPA. I haven't had time to build the
latest release yet.

With a little more context, we're seeing:
Feb 19 11:27:55 899340 [22C64700] 0x01 -> state_mgr_light_sweep_start:
ERR 3315: Unknown remote side for node 0x0002c902004158b0
(MF0;mlx1-b:IS5600/L13/U1) port 2. Adding to light sweep sampling list
Feb 19 11:27:55 899360 [22C64700] 0x01 -> Directed Path Dump of 3 hop
path: Path = 0,1,19,13Feb 19 11:27:55 899394 [22C64700] 0x01 ->
state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
0x0002c902004158d8 (MF0;mlx1-b:IS5600/L07/U1) port 6. Adding to light
sweep sampling list
Feb 19 11:27:55 899400 [22C64700] 0x01 -> Directed Path Dump of 3 hop
path: Path = 0,1,19,7Feb 19 11:27:56 706975 [21C62700] 0x01 ->
log_send_error: ERR 5411: DR SMP Send completed with error
(IB_TIMEOUT) -- dropping
                        Method 0x1, Attr 0x15, TID 0x1bc3b6
Feb 19 11:27:56 706999 [21C62700] 0x01 -> Received SMP on a 4 hop
path: Initial path = 0,1,19,13,2, Return path  = 0,0,0,0,0Feb 19
11:27:56 707008 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113:
MAD completed in error (IB_TIMEOUT): SubnGet(PortInfo), attr_mod 0x0,
TID 0x1bc3b6
Feb 19 11:27:56 707012 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR
3120 Timeout while getting attribute 0x15 (PortInfo); Possible mis-set
mkey?
Feb 19 11:27:56 707029 [21C62700] 0x01 -> log_send_error: ERR 5411: DR
SMP Send completed with error (IB_TIMEOUT) -- dropping
                        Method 0x1, Attr 0x15, TID 0x1bc3b7
Feb 19 11:27:56 707035 [21C62700] 0x01 -> Received SMP on a 4 hop
path: Initial path = 0,1,19,7,6, Return path  = 0,0,0,0,0Feb 19
11:27:56 707039 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113:
MAD completed in error (IB_TIMEOUT): SubnGet(PortInfo), attr_mod 0x0,
TID 0x1bc3b7
Feb 19 11:27:56 707042 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR
3120 Timeout while getting attribute 0x15 (PortInfo); Possible mis-set
mkey?

It looks like some lines are being mixed; is this just a lack of a
newline, or are the messages interspersed?

Does the initial path information identify the remote node having
troubles? How can I turn that into usable coordinates?
 -nld

On Tue, Feb 19, 2013 at 11:10 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
> On Tue, 19 Feb 2013 10:52:16 -0600
> Narayan Desai <narayan.desai at gmail.com> wrote:
>
>> Is there a good guide to decoding opensm error logs?
>>
>> i'm specifically seeing this:
>> Feb 19 10:50:26 667041 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR
>> 3120 Timeout while getting attribute 0x15 (PortInfo); Possible mis-set
>> mkey?
>> Feb 19 10:50:26 667057 [21C62700] 0x01 -> log_send_error: ERR 5411: DR
>> SMP Send completed with error (IB_TIMEOUT) -- dropping
>>                         Method 0x1, Attr 0x15, TID 0x1b684f
>>
>> a lot.
>
> What version of OpenSM is this?  Jim Foraker here at LLNL worked on the mkey support and we just went through fixing an issue similar to the above but I can't remember the details off the top of my head.
>
>>
>> Also, the timestamp is clear enough, but what do the next 3 fields
>> (667*, [21C6*, and 0x01 mean?
>
> 667* -- milisecond time stamp
> 21C* -- thread id
> 0x01 -- log level
>
> Ira
>
>> thanks.
>>  -nld
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>
>
> --
> Ira Weiny
> Member of Technical Staff
> Lawrence Livermore National Lab
> 925-423-8008
> weiny2 at llnl.gov



More information about the Users mailing list