[openib-general] OpenSM doesn't start on p5 570

Hal Rosenstock halr at voltaire.com
Mon Jan 16 03:31:46 PST 2006


Hi,

On Mon, 2006-01-16 at 05:48, Andrey Slepuhin wrote:
> Dear folks,
> 
> I have a problem starting opensm on a p5 570 machine.

Is this the first time trying this on a p5 machine ?

>  The following messages
> appear in the opensm log file:
> 
> ******************************************************************
> ******************** INITIATING HEAVY SWEEP **********************
> ******************************************************************
> 
> 
> Jan 16 13:30:55 737114 [40018DC0] -> osm_req_get: [
> Jan 16 13:30:55 737130 [40018DC0] -> osm_mad_pool_get: [
> Jan 16 13:30:55 737147 [40018DC0] -> osm_vendor_get: [
> Jan 16 13:30:55 737161 [40018DC0] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x100747dc, size = 256
> Jan 16 13:30:55 737176 [40018DC0] -> osm_vendor_get: Acquired UMAD 0x1008ee40, size = 256
> Jan 16 13:30:55 737192 [40018DC0] -> osm_vendor_get: ]
> Jan 16 13:30:55 737208 [40018DC0] -> osm_mad_pool_get: Acquired p_madw = 0x100747d0, p_mad = 0x1008ee78, size = 256
> Jan 16 13:30:55 737223 [40018DC0] -> osm_mad_pool_get: ]
> Jan 16 13:30:55 737238 [40018DC0] -> osm_req_get: Getting NodeInfo (0x11), modifier = 0x0, TID = 0x1234
> Jan 16 13:30:55 737255 [40018DC0] -> osm_vl15_post: [
> Jan 16 13:30:55 737269 [40018DC0] -> osm_vl15_post: Posting p_madw = 0x0x100747d0
> Jan 16 13:30:55 737284 [40018DC0] -> osm_vl15_post: 0 QP0 MADs on wire, 1 QP0 MADs outstanding
> Jan 16 13:30:55 737299 [40018DC0] -> osm_vl15_poll: [
> Jan 16 13:30:55 737313 [40018DC0] -> osm_vl15_poll: Signalling poller thread
> Jan 16 13:30:55 737334 [40018DC0] -> osm_vl15_poll: ]
> Jan 16 13:30:55 737338 [42827B20] -> __osm_vl15_poller: Servicing p_madw = 0x100747d0
> Jan 16 13:30:55 737352 [40018DC0] -> osm_vl15_post: ]
> Jan 16 13:30:55 737388 [40018DC0] -> osm_req_get: ]
> Jan 16 13:30:55 737404 [40018DC0] -> __osm_state_mgr_sweep_hop_0: ]
> Jan 16 13:30:55 737420 [40018DC0] -> osm_state_mgr_process: ]
> Jan 16 13:30:55 737436 [40018DC0] -> osm_sm_sweep: ]
> Jan 16 13:30:55 737464 [42827B20] -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x1 (SubnGet)
>                                 D bit...................0x0
>                                 status..................0x0
>                                 hop_ptr.................0x0
>                                 hop_count...............0x0
>                                 trans_id................0x1234
>                                 attr_id.................0x11 (NodeInfo)
>                                 resv....................0x0
>                                 attr_mod................0x0
>                                 m_key...................0x0000000000000000
>                                 dr_slid.................0xFFFF
>                                 dr_dlid.................0xFFFF
> 
>                                 Initial path: [0]
>                                 Return path:  [0]
>                                 Reserved:     [0][0][0][0][0][0][0]
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
> 
> Jan 16 13:30:55 737604 [42827B20] -> osm_vendor_send: [
> Jan 16 13:30:55 737742 [42827B20] -> osm_vendor_send: Completed Sending Request p_madw = 0x100747d0
> Jan 16 13:30:55 737761 [42827B20] -> osm_vendor_send: ]
> Jan 16 13:30:55 737768 [43027B20] -> osm_mad_pool_get: [
> Jan 16 13:30:55 737784 [42827B20] -> __osm_vl15_poller: 1 QP0 MADs on wire, 1 outstanding, 0 unicasts sent, 1 total sent
> Jan 16 13:30:55 737812 [43027B20] -> osm_vendor_get: [
> Jan 16 13:30:55 737848 [43027B20] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x10074724, size = 256
> Jan 16 13:30:55 737866 [43027B20] -> osm_vendor_get: Acquired UMAD 0x1008ef80, size = 256
> Jan 16 13:30:55 737883 [43027B20] -> osm_vendor_get: ]
> Jan 16 13:30:55 737897 [43027B20] -> osm_mad_pool_get: Acquired p_madw = 0x10074718, p_mad = 0x1008efb8, size = 256
> Jan 16 13:30:55 737915 [43027B20] -> osm_mad_pool_get: ]
> Jan 16 13:30:55 737939 [43027B20] -> umad_receiver: ERR 5413: Failed to obtain request madw for received MAD(method=0x81
> attr=0x11) -- dropping

This means that no matching transaction was found in transaction match
table. This may be an endian problem with the tid.

Can you validate the tid (print them out) in both get_madw and put_madw
in osm_vendor_ibumad.c ? Since this seems to happen early on, there
shouldn't be too many of these. Thanks.

> Jan 16 13:30:55 737960 [43027B20] -> osm_mad_pool_put: [
> Jan 16 13:30:55 737975 [43027B20] -> osm_mad_pool_put: Releasing p_madw = 0x10074718, p_mad = 0x1008ed00
> Jan 16 13:30:55 737993 [43027B20] -> osm_vendor_put: [
> Jan 16 13:30:55 738008 [43027B20] -> osm_vendor_put: Retiring UMAD 0x1008ecc8
> Jan 16 13:30:55 738026 [43027B20] -> osm_vendor_put: ]
> Jan 16 13:30:55 738041 [43027B20] -> osm_mad_pool_put: ]
> 
> 
> My configuration consists of two 23108 HCAs directly connected without a switch,
> firmware is 3.3.3, kernel is 2.6.15-4 from OpenSUSE repository, userspace
> revision is 4978.

Are the two HCAs on separate machines ?

-- Hal

> Any help will be much appreciated
> 
> Best regards,
> Andrey
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list