[ofa-general] RE: running opensm 3.0.3 on 4000+ node system

Hal Rosenstock hrosenstock at xsigo.com
Wed Apr 9 13:56:04 PDT 2008


On Wed, 2008-04-09 at 13:39 -0700, Hal Rosenstock wrote:
> Hi Christopher,
> 
> On Wed, 2008-04-09 at 13:14 -0600, Maestas, Christopher Daniel wrote:
> > Hello Hal,
> > 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:hrosenstock at xsigo.com]
> > Sent: Wednesday, April 09, 2008 12:38 PM
> > To: Maestas, Christopher Daniel
> > Cc: general at lists.openfabrics.org
> > Subject: Re: running opensm 3.0.3 on 4000+ node system
> > 
> > On Wed, 2008-04-09 at 12:26 -0600, Maestas, Christopher Daniel wrote:
> > > I'm trying to run opensm on a 4000+ node system,
> > 
> > Which version ? Do you mean 3.0.3 (or 3.0.13) ?
> > 
> > cdm> Version 3.0.13 ... you're right on that
> > # rpm -q opensm
> > opensm-3.0.3-6.el5_1.1
> > ---
> > Apr  9 12:49:53 HOST OpenSM[3295]: /var/log/osm.log log file opened
> > Apr  9 12:49:53 HOST OpenSM[3295]: OpenSM Rev:openib-3.0.13
> > Apr  9 12:49:53 HOST kernel: user_mad: process opensm did not enable P_Key index support.
> > Apr  9 12:49:53 HOST kernel: user_mad:   Documentation/infiniband/user_mad.txt has info on the new ABI.
> > Apr  9 12:49:59 HOST OpenSM[3295]: Entering MASTER state
> > Apr  9 12:50:02 HOST OpenSM[3295]: Errors during initialization
> 
> Your subnet has errors :-(
> 
> > Apr  9 12:50:16 HOST OpenSM[3295]: SUBNET UP
> > Apr  9 12:50:22 HOST kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
> > Apr  9 12:50:30 HOST OpenSM[3295]: Errors during initialization
> > Apr  9 12:51:05 HOST last message repeated 2 times
> > Apr  9 12:52:17 HOST last message repeated 3 times
> > Apr  9 12:53:27 HOST last message repeated 3 times
> > ...
> > 
> > >  and seem to be having difficulties in keeping the opensm around.
> > > When I attach to the process w/ strace it does:
> > > ---
> > > # strace -p 5921
> > > Process 5921 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>) = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > ...
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0},  <unfinished ...>
> > > +++ killed by SIGSEGV +++
> > > ---
> > >
> > > I have ofed 1.1 and 1.2 drivers loaded on the system.  I've done this in the past using opensm 3.0.0 svn tag 10188 from ofed 1.0 clients and had no issues before.  Here's how opensm is running:
> > > ---
> > >  6079 pts/0    Sl     0:08 /usr/sbin/opensm -d 3 -maxsmps 0 -s 300 -t 1000 -f /var/log/osm.log -V -g 0
> > > ---
> > >
> > > I have lots of data in the osm.log as you can imagine ... I don't know offhand what I should be looking at/for.
> > 
> > What's towards the end of the log ?
> > 
> > cdm>
> > I rebooted the node ... then brought ib0, then restarted opensmd ... It died when file got this big:
> > # ls -l osm.log -h
> > -rw-r--r-- 1 root root 3.2G Apr  9 13:12 osm.log
> > # tail osm.log
> > Apr 09 13:12:31 439877 [43204940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0089 Port 12 TID:0x00000000000032d3
> > Apr 09 13:12:31 440370 [41E02940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00D0 Port 3 TID:0x0000000000007480
> > Apr 09 13:12:31 440669 [43204940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00B3 Port 7 TID:0x00000000000058dd
> > Apr 09 13:12:31 440987 [41E02940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0082 Port 21 TID:0x000000000000285a
> > Apr 09 13:12:31 441228 [43204940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00E8 Port 10 TID:0x00000000000095a2
> > Apr 09 13:12:31 441579 [41E02940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x004A Port 1 TID:0x0000000000010d29
> > Apr 09 13:12:31 441847 [43204940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0063 Port 24 TID:0x000000000000e40c
> > Apr 09 13:12:31 442130 [41E02940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x000A Port 23 TID:0x000000000006fca2
> > Apr 09 13:12:31 442469 [43204940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port 18 TID:0x0000000000059fc4
> > Apr 09 13:12:31 442710 [41E02940] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port 17 TID:0x0000000000059fc5
> 
> Those are flow control watchdog errors.

One possible explanation for this: SM could be (mis)configuring
mismatched OperVLs at the two ends of these links. Not sure why.

-- Hal

>  Any special opensm options set
> in the option file or are you running with the defaults ?
> 
> -- Hal




More information about the general mailing list