[ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master

Venkatesh Babu venkatesh.babu at 3leafnetworks.com
Mon May 21 19:23:40 PDT 2007



Hal Rosenstock wrote:

>So there is no link between the 2 switches, right ?
>  
>
 That is right.

>
>Is there anything being done ? Cables pulled and reinserted ? Is
>anything changing or is this a "stable" configuration in terms of the
>topology ?
>  
>
 There was no configuration changes from the cable or switch 
perspective. But nodes were being rebooted.

>Is this the only thing going on on the subnet ?
>  
>
 That was ipoib but no other ulp modules. There was propritery ulp 
module which creates udqp and joins broadcast
group and discovers nodes and sets up rcqps. There was no traffic being run.

>So it did finally become master ?
>  
>
 Yes, from the /var/log/opensm1.log it looks like it became master. But 
it was not responding to
link local broadcast join operations. It was failing with -110, 
Connection timed out.

>I take it LID 6 is local (vortex31-83).
>
>This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you
>try OFED 1.2 ?
>  
>
  It is OFED 1.1 released stack. I have seen this problem with OFED 1.0 
also.
Trying with OFED 1.2 may take much longer time, since we need to port 
our stuff.

>What kernel is being used ? What distro ? What processor architecture ?
>  
>
 2.6.9-22.EL     RHEL 4.2           Dual Core AMD Opteron(tm) Processor 
270 HE

>
>Is this around the time of the error or just an error in the OpenSM log
>? 
>  
>
  The logs were frozen after these error messages. No new entries were 
being written to the log files.
After doing "sminfo -s3" I saw the some messages indicating that it 
moved to MASTER state and other messages.

May 21 00:40:28 013290 [41401960] -> __osm_trap_rcv_process_request: 
Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0007 
TID:0x0000000000000003
May 21 00:40:28 013431 [41401960] -> osm_report_notice: Reporting 
Generic Notice type:4 num:144 from LID:0x0007 
GID:0xfe80000000000000,0x005045014a2e0001
May 21 00:40:28 818202 [45007960] -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x11 trans_id=0x100000135b) -- 
dropping
May 21 00:40:28 819089 [45007960] -> umad_receiver: ERR 5411: DR SMP
May 21 00:40:28 819110 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
3113: MAD completed in error (IB_TIMEOUT)
May 21 00:40:28 819145 [45007960] -> SMP dump:
...
May 21 00:40:28 819247 [41E02960] -> Entering STANDBY state
May 21 14:04:17 204871 [45007960] -> umad_receiver: ERR 5404: recv error 
on MAD sized umad (Interrupted system call)
May 21 14:06:08 022096 [45007960] -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x20 trans_id=0x100000264f) -- 
dropping
May 21 14:06:08 022132 [45007960] -> umad_receiver: ERR 5411: DR SMP
May 21 14:06:08 022145 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
3113: MAD completed in error (IB_TIMEOUT)
May 21 14:06:08 022182 [45007960] -> SMP dump:
...
May 21 14:06:38 035957 [41401960] -> Entering MASTER state
May 21 14:06:38 038818 [42803960] -> osm_subn_set_up_down_min_hop_table: 
BFS through all port guids in the subnet ]
May 21 14:06:38 038886 [42803960] -> osm_ucast_mgr_process: Min Hop 
Tables configured on all switches
May 21 14:06:38 046438 [41401960] -> __osm_trap_rcv_process_request: 
Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000C 
TID:0x0000000000000ec4
May 21 14:06:38 046565 [41401960] -> osm_report_notice: Reporting 
Generic Notice type:1 num:128 from LID:0x000C 
GID:0xfe80000000000000,0x000b8cffff0024f9
May 21 14:06:38 108660 [42803960] -> SUBNET UP
May 21 14:06:38 402900 [41401960] -> __osm_trap_rcv_process_request: 
Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 
TID:0x0000000000000000
May 21 14:06:38 403007 [41401960] -> osm_report_notice: Reporting 
Generic Notice type:4 num:144 from LID:0x0001 
GID:0xfe80000000000000,0x0002c9020020f5c5
May 21 14:06:38 914806 [45007960] -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x20 trans_id=0x10000026f0) -- 
dropping
May 21 14:06:38 914823 [45007960] -> umad_receiver: ERR 5411: DR SMP
May 21 14:06:38 914864 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
3113: MAD completed in error (IB_TIMEOUT)
May 21 14:06:38 914899 [45007960] -> SMP dump:

>Did this change from 0 to 1 around the time of the issue with the SM
>mastership ?
>  
>
  Not sure, I just got the snapshot when I saw this problem.

>Also, what are the port counters for the switch ports in use ?
>  
>
[root at vortex3l-83 ~]# ibnetdiscover
ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
skipping port
#
# Topology file: generated on Mon May 21 02:11:34 2007
#
# Max of 2 hops discovered
# Initiated from node 005045014a3a0000 port 005045014a3a0001

vendid=0x2c9
devid=0xb924
sysimgguid=0xb8cffff0024f9
switchguid=0xb8cffff0024f9
Switch  24 "S-000b8cffff0024f9"         # MT47396 Infiniscale-III Mellanox
Technologies base port 0 lid 12 lmc 0
[18]    "H-005045014a2e0000"[1]
[11]    "H-0002c902002048b0"[1]
[10]    "H-0002c9020020f584"[1]
[19]    "H-005045014a3a0000"[1]

vendid=0x2c9
devid=0x6282
sysimgguid=0x5045014a2e0003
caguid=0x5045014a2e0000
Ca      2 "H-005045014a2e0000"          # vortex3l-84 HCA-1
[1]     "S-000b8cffff0024f9"[18]                # lid 7 lmc 0

vendid=0x2c9
devid=0x6282
sysimgguid=0x2c902002048b3
caguid=0x2c902002048b0
Ca      2 "H-0002c902002048b0"          # MT25218 InfiniHostEx Mellanox
Technologies
[1]     "S-000b8cffff0024f9"[11]                # lid 5 lmc 0

vendid=0x2c9
devid=0x6282
sysimgguid=0x2c9020020f587
caguid=0x2c9020020f584
Ca      2 "H-0002c9020020f584"          # MT25218 InfiniHostEx Mellanox
Technologies
[1]     "S-000b8cffff0024f9"[10]                # lid 8 lmc 0

vendid=0x2c9
devid=0x6282
sysimgguid=0x5045014a3a0003
caguid=0x5045014a3a0000
Ca      2 "H-005045014a3a0000"          # vortex3l-83 HCA-1
[1]     "S-000b8cffff0024f9"[19]                # lid 6 lmc 0
[root at vortex3l-83 ~]#

>Perhaps later; not just yet.
>  
>
>Are they all the same ?
>  
>
  More or less they are same. All of them have 9 threads and each thread 
is blocking form some event.

 VBabu



More information about the general mailing list