[openib-general] Re: [PATCH] osm: osm_vendor_umad osm_vendor_get_all_port_attr bug

Hal Rosenstock halr at voltaire.com
Tue Aug 30 07:46:29 PDT 2005


Hi Eitan,

On Sun, 2005-08-28 at 10:43, Eitan Zahavi wrote: 
> I agree that the index 0 of the guid,lids and the new linkstates arrays
> should be reserved for the default port. In the loop the index j is used
> to loop over all ports 0 .. N of the HCA's. It is clear that for HCA's
> port 0 will be skipped. However, since the current code does not advance
> the lid and linkstate accordingly the place for the port 0 will not be
> kept empty for the port 0.
> 
> Current code:
> for (j = 0; j <= ca.numports; j++) {
>    if (ca.ports[j]) {
>       *p_lid = ca.ports[j]->base_lid;
>       *p_linkstates = ca.ports[j]->state;
>       p_lid++;
>       p_linkstates++;
>    }
> }
> Should be:
> for (j = 0; j <= ca.numports; j++) {
>    if (ca.ports[j]) {
>       *p_lid = ca.ports[j]->base_lid;
>       *p_linkstates = ca.ports[j]->state;
>    }
>    /* as j advance even if the port is not valid, so should the
>       lid and state pointer */
>    p_lid++;
>    p_linkstates++;
> }
> 
> As I could not convince you with the above explanations in my previous
> mail I have written the following simple program to test the pre-and
> post patch effect:
> 
> /*
>    test program for dumping osm_vendor_get_all_port_attr results
> */
> 
> #include "stdio.h"
> #include <stdlib.h>
> #include <unistd.h>
> #include <vendor/osm_vendor_api.h>
> #include <opensm/osm_opensm.h>
> 
> #include <common.h>
> #define GUID_ARRAY_SIZE 64
> int
> main() {
>    osm_vendor_t vendor;
>    osm_log_t osm_log;
>    ib_api_status_t status;
>    uint32_t num_ports = GUID_ARRAY_SIZE;
>    ib_port_attr_t attr_array[GUID_ARRAY_SIZE];
>    int i;
> 
>    osm_log_construct(&osm_log);
>    osm_log_init(&osm_log, TRUE, 0xff, "/tmp/test_vendor.log");
> 
>    osm_vendor_init(&vendor, &osm_log, 1000);
> 
>    status = osm_vendor_get_all_port_attr(&vendor, attr_array, &num_ports );
>    if ( status != IB_SUCCESS )
>    {
>      printf( "\nError from osm_vendor_get_all_port_attr (%x)\n", status);
>      return;
>    }
> 
>    printf("\nListing GUIDs:\n");
>    for (i = 0; i < num_ports; i++) {
>      printf("Port %i:0x%"PRIx64" lid:0x%04x state:%x\n",
>             i,
>             cl_hton64(attr_array[i].port_guid),
>             cl_ntoh16(attr_array[i].lid),
>             attr_array[i].link_state
>             );
>    }
> 
>    exit(0);
> }
> 
> Without the above change I get:
> Listing GUIDs:
> Port 0:0xd9dffffff3d55 lid:0x0300 state:4
> Port 1:0xd9dffffff3d55 lid:0x0400 state:4
> Port 2:0xd9dffffff3d56 lid:0x0000 state:0
> 
> After the simple change I get:
> Listing GUIDs:
> Port 0:0xd9dffffff3d55 lid:0x0300 state:4
> Port 1:0xd9dffffff3d55 lid:0x0300 state:4
> Port 2:0xd9dffffff3d56 lid:0x0400 state:4
> 
> So as you can see - without the fix the lid of port 2 is presented as 
> the lid of port 1...

I understand the difference in the code and think the difference
perhaps relates to either a lack of clarity or confusion with the API as
follows: I don't see where it is defined what the index into the port
array means. I think we have 2 different interpretations and this
relates to how opensm/main.c  handles the results of calling this
routine. 

So the patch is incomplete although perhaps correct depending on the
interpretation. I'm not adverse to changing this as you indicate. I
would like to resolve this before embarking on the 1.8.0 merge.

Also, since we are in this area, I don't think switch port 0 would be
handled correctly by this code either.

> I guess you use ibstatus in your mail. Well ibstatus uses its own code
> so it shows the correct info anyway. 

That was just to show that the port states corresponded to the ones
shown by osm_vendor_get_all_port_attr with the print statements. Nothing
else.

-- Hal

> In my case that is:
> swlab223:/tmp/bld/libvendor>ibstatus
> Infiniband device 'mthca0' port 1 status:
>          default gid:     fe80:0000:0000:0000:000d:9dff:ffff:3d55
>          base lid:        0x3
>          sm lid:          0x1
>          state:           4: ACTIVE
>          phys state:      5: LinkUp
>          rate:            10 Gb/sec (4X)
> 
> Infiniband device 'mthca0' port 2 status:
>          default gid:     fe80:0000:0000:0000:000d:9dff:ffff:3d56
>          base lid:        0x4
>          sm lid:          0x1
>          state:           4: ACTIVE
>          phys state:      5: LinkUp
>          rate:            10 Gb/sec (4X)
> 
> 
> 
> Hal Rosenstock wrote:
> > Hi Eitan,
> > 
> > On Sun, 2005-08-21 at 03:32, Eitan Zahavi wrote:
> > 
> >>osm_vendor_get_all_port_attr returns incorrect LID and state for 
> >>device ports. This bug was caused by the fact that if a device port
> >>was skipped due to that fact it does not exist (HCA port 0). The 
> >>lid and state pointers used as indexes into their corresponding
> >>return value arrays were not advancing to the next port index.
> >>
> >>So the return for a single HCA was mixing LID and state for the first
> >>port and displayed non initialized memory for the second port.
> > 
> > 
> > The array is not filled in as you claim. Port 0 does not take a slot on
> > an HCA. This looks fine to me as is (I added some print statements in
> > that loop as follows):
> > 
> > osm_vendor_get_all_port_attr: port 0
> > osm_vendor_get_all_port_attr: port 1
> > osm_vendor_get_all_port_attr: port 1 lid 1 state 4
> > osm_vendor_get_all_port_attr: port 2
> > osm_vendor_get_all_port_attr: port 2 lid 0 state 1
> > 
> > Port 0 is skipped; port 1 is LID 1 and active; port 2 is not plugged in
> > and is down:
> > 
> >         Port 1:
> >                 State: Active
> >                 Physical state: LinkUp
> >                 Rate: 2
> >                 Base lid: 1
> >                 LMC: 0
> >                 SM lid: 1
> >                 Capability mask: 0x00500a68
> >                 Port GUID: 0x0008f10403960559
> >         Port 2:
> >                 State: Down
> >                 Physical state: Polling
> >                 Rate: 2
> >                 Base lid: 0
> >                 LMC: 0
> >                 SM lid: 0
> >                 Capability mask: 0x00500a68
> >                 Port GUID: 0x0008f1040396055a
> > 
> > -- Hal
> 




More information about the general mailing list