[ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master

Mon May 21 20:45:57 PDT 2007

On Mon, 2007-05-21 at 22:23, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >So there is no link between the 2 switches, right ?
> >  
> >
>  That is right.
> 
> >
> >Is there anything being done ? Cables pulled and reinserted ? Is
> >anything changing or is this a "stable" configuration in terms of the
> >topology ?
> >  
> >
>  There was no configuration changes from the cable or switch 
> perspective. But nodes were being rebooted.
> 
> >Is this the only thing going on on the subnet ?
> >  
> >
>  That was ipoib but no other ulp modules. There was propritery ulp 
> module which creates udqp and joins broadcast
> group and discovers nodes and sets up rcqps. There was no traffic being run.
> 
> >So it did finally become master ?
> >  
> >
>  Yes, from the /var/log/opensm1.log it looks like it became master. But 
> it was not responding to
> link local broadcast join operations. It was failing with -110, 
> Connection timed out.
> 
> >I take it LID 6 is local (vortex31-83).
> >
> >This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you
> >try OFED 1.2 ?
> >  
> >
>   It is OFED 1.1 released stack. I have seen this problem with OFED 1.0 
> also.
> Trying with OFED 1.2 may take much longer time, since we need to port 
> our stuff.

Can you at least use OFED 1.2 management (OpenSM and management
libraries) with the rest being OFED 1.1 ?

There are a number of bugs which have been fixed which might affect
this. The one I can think of off the top of my head is a fix to atomics
in OpenSM's complib. I think that was found and fixed post OFED 1.1.
I'll confirm this tomorrow.

There may also be some important kernel differences (in user_mad.c or
mad.c) which might be relevant.

> >What kernel is being used ? What distro ? What processor architecture ?
> >  
> >
>  2.6.9-22.EL     RHEL 4.2           Dual Core AMD Opteron(tm) Processor 
> 270 HE
> 
> >
> >Is this around the time of the error or just an error in the OpenSM log
> >? 
> >  
> >
>   The logs were frozen after these error messages. No new entries were 
> being written to the log files.
> After doing "sminfo -s3" I saw the some messages indicating that it 
> moved to MASTER state and other messages.
> 
> May 21 00:40:28 013290 [41401960] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0007 
> TID:0x0000000000000003
> May 21 00:40:28 013431 [41401960] -> osm_report_notice: Reporting 
> Generic Notice type:4 num:144 from LID:0x0007 
> GID:0xfe80000000000000,0x005045014a2e0001
> May 21 00:40:28 818202 [45007960] -> umad_receiver: ERR 5409: send 
> completed with error (method=0x1 attr=0x11 trans_id=0x100000135b) -- 
> dropping
> May 21 00:40:28 819089 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 00:40:28 819110 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 00:40:28 819145 [45007960] -> SMP dump:
> ...
> May 21 00:40:28 819247 [41E02960] -> Entering STANDBY state
> May 21 14:04:17 204871 [45007960] -> umad_receiver: ERR 5404: recv error 
> on MAD sized umad (Interrupted system call)
> May 21 14:06:08 022096 [45007960] -> umad_receiver: ERR 5409: send 
> completed with error (method=0x1 attr=0x20 trans_id=0x100000264f) -- 
> dropping
> May 21 14:06:08 022132 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 14:06:08 022145 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 14:06:08 022182 [45007960] -> SMP dump:
> ...
> May 21 14:06:38 035957 [41401960] -> Entering MASTER state
> May 21 14:06:38 038818 [42803960] -> osm_subn_set_up_down_min_hop_table: 
> BFS through all port guids in the subnet ]
> May 21 14:06:38 038886 [42803960] -> osm_ucast_mgr_process: Min Hop 
> Tables configured on all switches
> May 21 14:06:38 046438 [41401960] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000C 
> TID:0x0000000000000ec4
> May 21 14:06:38 046565 [41401960] -> osm_report_notice: Reporting 
> Generic Notice type:1 num:128 from LID:0x000C 
> GID:0xfe80000000000000,0x000b8cffff0024f9
> May 21 14:06:38 108660 [42803960] -> SUBNET UP
> May 21 14:06:38 402900 [41401960] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 
> TID:0x0000000000000000
> May 21 14:06:38 403007 [41401960] -> osm_report_notice: Reporting 
> Generic Notice type:4 num:144 from LID:0x0001 
> GID:0xfe80000000000000,0x0002c9020020f5c5
> May 21 14:06:38 914806 [45007960] -> umad_receiver: ERR 5409: send 
> completed with error (method=0x1 attr=0x20 trans_id=0x10000026f0) -- 
> dropping
> May 21 14:06:38 914823 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 14:06:38 914864 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 14:06:38 914899 [45007960] -> SMP dump:
> 
> >Did this change from 0 to 1 around the time of the issue with the SM
> >mastership ?
> >  
> >
>   Not sure, I just got the snapshot when I saw this problem.
> 
> >Also, what are the port counters for the switch ports in use ?
> >  
> >
> [root at vortex3l-83 ~]# ibnetdiscover

I was referring to using perfquery, not ibnetdiscover.

> ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
> skipping port

Was this node rebooting while you did this or is there some other issue
?

> #
> # Topology file: generated on Mon May 21 02:11:34 2007
> #
> # Max of 2 hops discovered
> # Initiated from node 005045014a3a0000 port 005045014a3a0001
> 
> vendid=0x2c9
> devid=0xb924
> sysimgguid=0xb8cffff0024f9
> switchguid=0xb8cffff0024f9
> Switch  24 "S-000b8cffff0024f9"         # MT47396 Infiniscale-III Mellanox
> Technologies base port 0 lid 12 lmc 0
> [18]    "H-005045014a2e0000"[1]
> [11]    "H-0002c902002048b0"[1]
> [10]    "H-0002c9020020f584"[1]
> [19]    "H-005045014a3a0000"[1]

So run these (before and after):
perfquery 12 18
perfquery 12 11
perfquery 12 10
perfquery 12 19

and

perfquery 12 9

-- Hal

> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x5045014a2e0003
> caguid=0x5045014a2e0000
> Ca      2 "H-005045014a2e0000"          # vortex3l-84 HCA-1
> [1]     "S-000b8cffff0024f9"[18]                # lid 7 lmc 0
> 
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x2c902002048b3
> caguid=0x2c902002048b0
> Ca      2 "H-0002c902002048b0"          # MT25218 InfiniHostEx Mellanox
> Technologies
> [1]     "S-000b8cffff0024f9"[11]                # lid 5 lmc 0
> 
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x2c9020020f587
> caguid=0x2c9020020f584
> Ca      2 "H-0002c9020020f584"          # MT25218 InfiniHostEx Mellanox
> Technologies
> [1]     "S-000b8cffff0024f9"[10]                # lid 8 lmc 0
> 
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x5045014a3a0003
> caguid=0x5045014a3a0000
> Ca      2 "H-005045014a3a0000"          # vortex3l-83 HCA-1
> [1]     "S-000b8cffff0024f9"[19]                # lid 6 lmc 0
> [root at vortex3l-83 ~]#
> 
> >Perhaps later; not just yet.
> >  
> >
> >Are they all the same ?
> >  
> >
>   More or less they are same. All of them have 9 threads and each thread 
> is blocking form some event.
> 
>  VBabu