[ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master
Hal Rosenstock
halr at voltaire.com
Mon May 21 20:45:57 PDT 2007
On Mon, 2007-05-21 at 22:23, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
>
> >So there is no link between the 2 switches, right ?
> >
> >
> That is right.
>
> >
> >Is there anything being done ? Cables pulled and reinserted ? Is
> >anything changing or is this a "stable" configuration in terms of the
> >topology ?
> >
> >
> There was no configuration changes from the cable or switch
> perspective. But nodes were being rebooted.
>
> >Is this the only thing going on on the subnet ?
> >
> >
> That was ipoib but no other ulp modules. There was propritery ulp
> module which creates udqp and joins broadcast
> group and discovers nodes and sets up rcqps. There was no traffic being run.
>
> >So it did finally become master ?
> >
> >
> Yes, from the /var/log/opensm1.log it looks like it became master. But
> it was not responding to
> link local broadcast join operations. It was failing with -110,
> Connection timed out.
>
> >I take it LID 6 is local (vortex31-83).
> >
> >This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you
> >try OFED 1.2 ?
> >
> >
> It is OFED 1.1 released stack. I have seen this problem with OFED 1.0
> also.
> Trying with OFED 1.2 may take much longer time, since we need to port
> our stuff.
Can you at least use OFED 1.2 management (OpenSM and management
libraries) with the rest being OFED 1.1 ?
There are a number of bugs which have been fixed which might affect
this. The one I can think of off the top of my head is a fix to atomics
in OpenSM's complib. I think that was found and fixed post OFED 1.1.
I'll confirm this tomorrow.
There may also be some important kernel differences (in user_mad.c or
mad.c) which might be relevant.
> >What kernel is being used ? What distro ? What processor architecture ?
> >
> >
> 2.6.9-22.EL RHEL 4.2 Dual Core AMD Opteron(tm) Processor
> 270 HE
>
> >
> >Is this around the time of the error or just an error in the OpenSM log
> >?
> >
> >
> The logs were frozen after these error messages. No new entries were
> being written to the log files.
> After doing "sminfo -s3" I saw the some messages indicating that it
> moved to MASTER state and other messages.
>
> May 21 00:40:28 013290 [41401960] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0007
> TID:0x0000000000000003
> May 21 00:40:28 013431 [41401960] -> osm_report_notice: Reporting
> Generic Notice type:4 num:144 from LID:0x0007
> GID:0xfe80000000000000,0x005045014a2e0001
> May 21 00:40:28 818202 [45007960] -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x11 trans_id=0x100000135b) --
> dropping
> May 21 00:40:28 819089 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 00:40:28 819110 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 00:40:28 819145 [45007960] -> SMP dump:
> ...
> May 21 00:40:28 819247 [41E02960] -> Entering STANDBY state
> May 21 14:04:17 204871 [45007960] -> umad_receiver: ERR 5404: recv error
> on MAD sized umad (Interrupted system call)
> May 21 14:06:08 022096 [45007960] -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x20 trans_id=0x100000264f) --
> dropping
> May 21 14:06:08 022132 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 14:06:08 022145 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 14:06:08 022182 [45007960] -> SMP dump:
> ...
> May 21 14:06:38 035957 [41401960] -> Entering MASTER state
> May 21 14:06:38 038818 [42803960] -> osm_subn_set_up_down_min_hop_table:
> BFS through all port guids in the subnet ]
> May 21 14:06:38 038886 [42803960] -> osm_ucast_mgr_process: Min Hop
> Tables configured on all switches
> May 21 14:06:38 046438 [41401960] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000C
> TID:0x0000000000000ec4
> May 21 14:06:38 046565 [41401960] -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 from LID:0x000C
> GID:0xfe80000000000000,0x000b8cffff0024f9
> May 21 14:06:38 108660 [42803960] -> SUBNET UP
> May 21 14:06:38 402900 [41401960] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001
> TID:0x0000000000000000
> May 21 14:06:38 403007 [41401960] -> osm_report_notice: Reporting
> Generic Notice type:4 num:144 from LID:0x0001
> GID:0xfe80000000000000,0x0002c9020020f5c5
> May 21 14:06:38 914806 [45007960] -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x20 trans_id=0x10000026f0) --
> dropping
> May 21 14:06:38 914823 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 14:06:38 914864 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 14:06:38 914899 [45007960] -> SMP dump:
>
> >Did this change from 0 to 1 around the time of the issue with the SM
> >mastership ?
> >
> >
> Not sure, I just got the snapshot when I saw this problem.
>
> >Also, what are the port counters for the switch ports in use ?
> >
> >
> [root at vortex3l-83 ~]# ibnetdiscover
I was referring to using perfquery, not ibnetdiscover.
> ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
> skipping port
Was this node rebooting while you did this or is there some other issue
?
> #
> # Topology file: generated on Mon May 21 02:11:34 2007
> #
> # Max of 2 hops discovered
> # Initiated from node 005045014a3a0000 port 005045014a3a0001
>
> vendid=0x2c9
> devid=0xb924
> sysimgguid=0xb8cffff0024f9
> switchguid=0xb8cffff0024f9
> Switch 24 "S-000b8cffff0024f9" # MT47396 Infiniscale-III Mellanox
> Technologies base port 0 lid 12 lmc 0
> [18] "H-005045014a2e0000"[1]
> [11] "H-0002c902002048b0"[1]
> [10] "H-0002c9020020f584"[1]
> [19] "H-005045014a3a0000"[1]
So run these (before and after):
perfquery 12 18
perfquery 12 11
perfquery 12 10
perfquery 12 19
and
perfquery 12 9
-- Hal
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x5045014a2e0003
> caguid=0x5045014a2e0000
> Ca 2 "H-005045014a2e0000" # vortex3l-84 HCA-1
> [1] "S-000b8cffff0024f9"[18] # lid 7 lmc 0
>
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x2c902002048b3
> caguid=0x2c902002048b0
> Ca 2 "H-0002c902002048b0" # MT25218 InfiniHostEx Mellanox
> Technologies
> [1] "S-000b8cffff0024f9"[11] # lid 5 lmc 0
>
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x2c9020020f587
> caguid=0x2c9020020f584
> Ca 2 "H-0002c9020020f584" # MT25218 InfiniHostEx Mellanox
> Technologies
> [1] "S-000b8cffff0024f9"[10] # lid 8 lmc 0
>
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x5045014a3a0003
> caguid=0x5045014a3a0000
> Ca 2 "H-005045014a3a0000" # vortex3l-83 HCA-1
> [1] "S-000b8cffff0024f9"[19] # lid 6 lmc 0
> [root at vortex3l-83 ~]#
>
> >Perhaps later; not just yet.
> >
> >
> >Are they all the same ?
> >
> >
> More or less they are same. All of them have 9 threads and each thread
> is blocking form some event.
>
> VBabu
More information about the general
mailing list