[openib-general] opensm errors with ehca

Hal Rosenstock halr at voltaire.com
Tue Nov 1 18:26:56 PST 2005


On Sun, 2005-10-30 at 18:55, Troy Benjegerdes wrote:
> The firmware on the IBM eHCA causes opensm to spit out these kinds of
> errors all the time..
> 
> Is there a way we can either not send P_KeyTable requests to any eHCA
> guids, or figure out what (if anything) is broken in their firmware?
> 
> Is this a spec violation, or just ambiguities in implementation?
> 
> Oct 30 17:49:46 053820 [43005960] -> umad_receiver: ERR 5409: send
> completed wit
> h error (method=0x1 attr=0x16 trans_id=0x158c) -- dropping.
> Oct 30 17:49:46 053830 [43005960] -> umad_receiver: ERR 5411: DR SMP hop
> ptr 0 h
> op count 2 DR SLID 0x0 DR DLID 0x0
> Oct 30 17:49:46 053839 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
> 3113: MA
> D completed in error (IB_TIMEOUT).
> Oct 30 17:49:46 053861 [43005960] -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x1 (SubnGet)
>                                 D bit...................0x0
>                                 status..................0x0
>                                 hop_ptr.................0x0
>                                 hop_count...............0x2
>                                 trans_id................0x158c
>                                 attr_id.................0x16 (P_KeyTable)
>                                 resv....................0x0
>                                 attr_mod................0x260000
>                                 m_key...................0x0000000000000000
>                                 dr_slid.................0xFFFF
>                                 dr_dlid.................0xFFFF
> 
>                                 Initial path: [0][1][16]
>                                 Return path:  [0][0][0]
>                                 Reserved:     [0][0][0][0][0][0][0]

Can you try the following opensm patch and see if this eliminates those
timeout messages ?

This patch clears the high part of the attribute modifier when not a
switch (when obtaining the PKeyTable).

-- Hal

Index: osm_port_info_rcv.c
===================================================================
--- osm_port_info_rcv.c	(revision 3906)
+++ osm_port_info_rcv.c	(working copy)
@@ -430,6 +430,7 @@ void osm_pkey_get_tables(
   osm_dr_path_t path;
   uint8_t  port_num;
   uint16_t block_num, max_blocks;
+  uint32_t attr_mod_ho;
   osm_switch_t* p_switch;
 
   OSM_LOG_ENTER( p_log, osm_physp_has_pkey );
@@ -455,7 +456,7 @@ void osm_pkey_get_tables(
   else
   {
     /* This is a switch, and not a management port. The maximum blocks is defined
-       on the switch info partition enforcement cap. */
+       in the switch info partition enforcement cap. */
     p_switch = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid);
 
     if (! p_switch)
@@ -472,10 +473,14 @@ void osm_pkey_get_tables(
 
   for (block_num = 0 ; block_num < max_blocks  ; block_num++)
   {
+    if (osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH)
+      attr_mod_ho = block_num;
+    else
+      attr_mod_ho = block_num | (port_num << 16);
     status = osm_req_get( p_req,
                           &path,
                           IB_MAD_ATTR_P_KEY_TABLE,
-                          cl_hton32(block_num | (port_num << 16) ),
+                          cl_hton32(attr_mod_ho),
                           CL_DISP_MSGID_NONE,
                           &context );
 






More information about the general mailing list