[ofa-general] Re: [PATCH] opensm: enforce routing paths rebalancing on switch reconnection
Al Chu
chu11 at llnl.gov
Wed Mar 5 09:10:08 PST 2008
On Wed, 2008-03-05 at 10:43 +0000, Sasha Khapyorsky wrote:
> Hi Al,
>
> On 07:46 Sun 02 Mar , Albert Chu wrote:
> >
> > In order to make things work, I also had to add this patch. Seems like a
> > corner case that needs to be handled since we never fall into
> > __osm_pi_rcv_process_switch_port().
>
> Hmm, it is strange. After this light sweep cycle OpenSM should continue
> with heavy sweep where __osm_pi_rcv_process_switch_port() should be
> reissued. Do you see any errors during discovery?
I can't restart opensm on that cluster at this time. I don't recall any
port errors. However, I do recall seeing this output from
__osm_state_mgr_light_sweep_start():
OSM_LOG(sm->p_log, OSM_LOG_ERROR,
"ERR 0108: "
"Unknown remote side for node 0x%016"
PRIx64
"(%s) port %u. Adding to light sweep sampling list\n",
cl_ntoh64(osm_node_get_node_guid
(p_node)),
p_node->print_desc, port_num);
leading to a call to __osm_state_mgr_get_remote_port_info(), leading to
what I fixed in osm_pi_rcv_process().
My original assumption was that the remote side for some ports wasn't
known b/c the remote side ports were down. Is it possible for opensm to
not know about a remote side even if that remote side port is up/active?
Al
> Sasha
--
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
More information about the general
mailing list