[ofa-general] Re: [PATCH] opensm: enforce routing paths rebalancing on switch reconnection

Al Chu chu11 at llnl.gov
Wed Mar 5 09:10:08 PST 2008


On Wed, 2008-03-05 at 10:43 +0000, Sasha Khapyorsky wrote:
> Hi Al,
> 
> On 07:46 Sun 02 Mar     , Albert Chu wrote:
> > 
> > In order to make things work, I also had to add this patch.  Seems like a
> > corner case that needs to be handled since we never fall into
> > __osm_pi_rcv_process_switch_port().
> 
> Hmm, it is strange. After this light sweep cycle OpenSM should continue
> with heavy sweep where __osm_pi_rcv_process_switch_port() should be
> reissued. Do you see any errors during discovery?

I can't restart opensm on that cluster at this time.  I don't recall any
port errors.  However, I do recall seeing this output from
__osm_state_mgr_light_sweep_start():

OSM_LOG(sm->p_log, OSM_LOG_ERROR,
        "ERR 0108: "
        "Unknown remote side for node 0x%016"
        PRIx64
        "(%s) port %u. Adding to light sweep sampling list\n",
        cl_ntoh64(osm_node_get_node_guid
                  (p_node)),
        p_node->print_desc, port_num);

leading to a call to __osm_state_mgr_get_remote_port_info(), leading to
what I fixed in osm_pi_rcv_process().

My original assumption was that the remote side for some ports wasn't
known b/c the remote side ports were down.  Is it possible for opensm to
not know about a remote side even if that remote side port is up/active?

Al

> Sasha
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory



More information about the general mailing list