[ofa-general] Re: [PATCHv2] osm: Clearing lid matrices before rebuilding them

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Mon Mar 19 07:50:56 PDT 2007


Sasha Khapyorsky wrote:
> On Mon, 2007-03-19 at 13:42 +0200, Yevgeny Kliteynik wrote:
>> Hi Hal,
>>
>> [V2 of the patch]
>>
>> This patch fixes a bug in the lid matrices creation:
>>
>> The lid matrices were not cleared, which caused OSM routing
>> to crash when routing nonexisting (disconnected) lids.
> 
> Where the crash happens? Could you clarify?

In __osm_ucast_mgr_process_neighbor(), there is the following assertion:

        CL_ASSERT( hops <= osm_switch_get_hop_count( p_sw, lid_ho,
                                                     port_num ) );

This assertion fails, since the hop count becomes inconsistent.


> Anyway I agree that there is no point to try to generate routes to
> disconnected lids, so cleanup is ok. Now about the patch itself - I
> would prefer to place min hop tables cleanup in
> osm_switch_prepare_path_rebuild(), just something like this:
> 
> diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c
> index 7c36a54..9273459 100644
> --- a/osm/opensm/osm_switch.c
> +++ b/osm/opensm/osm_switch.c
> @@ -516,6 +516,9 @@ osm_switch_prepare_path_rebuild(
>  
>    for ( i = 0; i < p_sw->num_ports; i++ )
>      osm_port_prof_construct( &p_sw->p_prof[i] );
> +
> +  osm_switch_clear_hops(p_sw);
> +
>    if (!p_sw->hops)
>    {
>      hops = malloc((max_lids + 1)*sizeof(hops[0]));

No problem.
 
>> Please apply to ofed_1_2.
>>
>> I'm not sure about the trunk though.
>> Sasha,
>> Can you please check that you latest improvements to the
>> routing don't have this problem?
> 
> With disconnecting switches should be similar behavior I guess.

Right, I checked it - same problem.

I'll issue new patch.

-- Yevgeny.

> Sasha
> 
>> Thanks. 
>>
>> -- Yevgeny
>>
>> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  osm/opensm/osm_ucast_mgr.c |   20 ++++++++++++++------
>>  1 files changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
>> index ee6b3f9..8643754 100644
>> --- a/osm/opensm/osm_ucast_mgr.c
>> +++ b/osm/opensm/osm_ucast_mgr.c
>> @@ -1189,6 +1189,7 @@ ucast_mgr_setup_all_switches(osm_subn_t
>>  {
>>    osm_switch_t *p_sw;
>>    uint16_t lids;
>> +  uint16_t i;
>>  
>>    lids = (uint16_t)cl_ptr_vector_get_size(&p_subn->port_lid_tbl);
>>    lids = lids ? lids - 1 : 0;
>> @@ -1196,12 +1197,19 @@ ucast_mgr_setup_all_switches(osm_subn_t
>>    for (p_sw = (osm_switch_t*)cl_qmap_head(&p_subn->sw_guid_tbl);
>>         p_sw != (osm_switch_t*)cl_qmap_end(&p_subn->sw_guid_tbl);
>>         p_sw = (osm_switch_t*)cl_qmap_next(&p_sw->map_item))
>> -  if (osm_switch_prepare_path_rebuild(p_sw, lids)) {
>> -    osm_log(&p_subn->p_osm->log, OSM_LOG_ERROR,
>> -            "ucast_mgr_setup_all_switches: ERR 3A0B: "
>> -            "cannot setup switch 0x%016" PRIx64 "\n",
>> -            cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
>> -    return -1;
>> +  {
>> +    if (osm_switch_prepare_path_rebuild(p_sw, lids)) {
>> +      osm_log(&p_subn->p_osm->log, OSM_LOG_ERROR,
>> +              "ucast_mgr_setup_all_switches: ERR 3A0B: "
>> +              "cannot setup switch 0x%016" PRIx64 "\n",
>> +              cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
>> +      return -1;
>> +    }
>> +
>> +    /* Clear the LID matrix of the switch */
>> +    for ( i = 0; i < p_sw->num_hops; i++ )
>> +      if (p_sw->hops[i])
>> +        memset(p_sw->hops[i], OSM_NO_PATH, p_sw->num_ports);
>>    }
>>  
>>    return 0;
> 




More information about the general mailing list