[openib-general] Re: OpenSM unable to bring up subnet

Hal Rosenstock halr at voltaire.com
Mon Nov 7 18:30:53 PST 2005


On Mon, 2005-11-07 at 21:05, Sayantan Sur wrote:
> Hi,
>
> I am using OpenSM (svn rev 3984 and with 3882). It is unable to bring up
> the subnet and "hangs". This behavior is observed with machines are
> connected back-to-back as well as with any switch. My kernel version is
> 2.6.13.1, machines are Opteron (on Tyan S295 motherboard). I have
> included the log file. Maybe someone can tell if I am doing anything wrong?

Is the infiniband support from 2.6.13.1 or has it been replaced with
OpenIB svn of the revs indicated (or is that only OpenSM) ? If it is
only OpenSM, I would recommend trying to update at least user_mad.c as
there have been a number of problems which have been fixed in this.
There will be some backport issues to 2.6.13.1 to deal with but they
have all been discussed on the list.

> [surs at ro0:~] lsmod | grep ^ib
> ib_ucm                 22280  0
> ib_cm                  37616  1 ib_ucm
> ib_uverbs              40984  0
> ib_umad                17824  2
> ib_mthca              124320  0
> ib_mad                 42660  3 ib_cm,ib_umad,ib_mthca
> ib_core                56320  6
> ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad
>
> [surs at ro0:tmp] ls -l /dev/infiniband/
> total 0
> crw-rw----  1 root root 231,  64 2005-11-08 02:23 issm0
> crw-rw----  1 root root 231,  65 2005-11-08 02:23 issm1
> crw-rw-rw-  1 root root 231, 224 2005-11-08 02:23 ucm0
> crw-rw----  1 root root 231,   0 2005-11-08 02:23 umad0
> crw-rw----  1 root root 231,   1 2005-11-08 02:23 umad1
> crw-rw-rw-  1 root root 231, 192 2005-11-08 02:23 uverbs0
>
>
> <====

Was opensm started with -V ?

> Nov 08 02:59:33 576837 [AB454D00] -> OpenSM Rev:openib-1.1.0
> Nov 08 02:59:33 576979 [0000] -> OpenSM Rev:openib-1.1.0
>
> Nov 08 02:59:33 577953 [AB454D00] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000
> GID:0xfe80000000000000,0x0000000000000000
> Nov 08 02:59:33 578017 [AB454D00] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000
> GID:0xfe80000000000000,0x0000000000000000
> Nov 08 02:59:33 581289 [AB454D00] -> osm_vendor_get_all_port_attr:
> assign CA mthca0 port 1 guid (0x2c902004002e9) as the default port.
> Nov 08 02:59:33 581326 [AB454D00] -> osm_vendor_bind: Binding to port
> 0x2c902004002e9.
> Nov 08 02:59:33 583680 [AB454D00] -> osm_vendor_bind: Binding to port
> 0x2c902004002e9.
> Nov 08 02:59:33 987191 [40C05960] -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x11 trans_id=0x1234) -- dropping.
> Nov 08 02:59:33 987227 [40C05960] -> umad_receiver: ERR 5411: DR SMP hop
> ptr 0 hop count 0 DR SLID 0x0 DR DLID 0x0
> Nov 08 02:59:33 987243 [40C05960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
> 3113: MAD completed in error (IB_TIMEOUT).
> Nov 08 02:59:33 987303 [40C05960] -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x1 (SubnGet)
>                                 D bit...................0x0
>                                 status..................0x0
>                                 hop_ptr.................0x0
>                                 hop_count...............0x0
>                                 trans_id................0x1234
>                                 attr_id.................0x11 (NodeInfo)
>                                 resv....................0x0
>                                 attr_mod................0x0
>                                 m_key...................0x0000000000000000
>                                 dr_slid.................0xFFFF
>                                 dr_dlid.................0xFFFF
>
>                                 Initial path: [0]
>                                 Return path:  [0]
>                                 Reserved:     [0][0][0][0][0][0][0]
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00
> 00 00 00
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00
> 00 00 00
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00
> 00 00 00
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00
> 00 00 00
>
> Nov 08 02:59:33 987391 [40401960] -> __osm_state_mgr_is_sm_port_down:
> ERR 3308: SM port GUID unknown.

Since gets are timing out, there is no response to SubnGet NodeInfo for
the local node which sets the SM port GUID.

Anyrhing relevant in dmesg ?

-- Hal

> Nov 08 02:59:33 987408 [0000] -> SM port is down.
>
> Nov 08 02:59:33 987485 [40401960] -> __osm_sm_state_mgr_signal_error:
> ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> ===>






More information about the general mailing list