[openib-general] OpenSM Issues of the last couple days

Hal Rosenstock halr at voltaire.com
Fri Dec 8 14:33:18 PST 2006


On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: 
> Hal Rosenstock wrote:
> > Hi Eitan,
> >
> > Just wanted to close the loop on the OpenSM issues of the last couple
> > days.
> >
> > 1. When can you supply an OpenSM verbose log for the InformInfo
> > subscribe problem you reported earlier today ? Failing that, I don't
> > know how to reproduce this.
> >   
> Attached

Hmmm....

osmtest seems to fail much earlier than OpenSM unless I am mistaken.
OpenSM sees the final InformInfo unsubscribe (cleanup) and fails on
that. I thought the osmtest side failed earlier.

In a number of places in osm.log, I see:
Dec 08 18:17:02 266690 [B2562BB0] -> __osmv_dispatch_rmpp_mad: [
Dec 08 18:17:02 266707 [B2562BB0] -> __osmv_dispatch_rmpp_snd: [
Dec 08 18:17:02 266723 [B2562BB0] -> Not supposed to receive DATA packets --> dropping the MAD
Dec 08 18:17:02 266739 [B2562BB0] -> __osmv_dispatch_rmpp_snd: ]
Dec 08 18:17:02 266755 [B2562BB0] -> __osmv_dispatch_rmpp_mad: ]
Is that supposed to happen ? What does that mean ? Does that mess things up ?

SA GetTable InformInfoRecord
Dec 08 18:17:02 265333 [B6B69BB0] -> osm_infr_rcv_process_get_method: Query Subscriber GID:0x0000000000000000 : 0x0000000000000000(00) Enum:0x0(01)
Dec 08 18:17:02 265370 [B6B69BB0] -> __osm_sa_inform_info_rec_by_comp_mask: [
Dec 08 18:17:02 265388 [B2562BB0] -> osmv_dispatch_mad: ]
Dec 08 18:17:02 265406 [B6B69BB0] -> osm_infr_get_by_enum: [
Dec 08 18:17:02 265424 [B2562BB0] -> __osmv_ibms_receiver_callback: ]
Dec 08 18:17:02 265443 [B6B69BB0] -> osm_infr_get_by_enum: ]
Dec 08 18:17:02 265482 [B6B69BB0] -> __osm_sa_inform_info_rec_by_comp_mask: ]
Dec 08 18:17:02 265499 [B6B69BB0] -> osm_infr_rcv_process_get_method: Returning 1 records

SA Set InformInfo 
Dec 08 18:17:02 269386 [B756ABB0] -> osm_infr_rcv_process_set_method: UnSubscribe Request with QPN: 0x000001
Dec 08 18:17:02 269421 [B756ABB0] -> osm_infr_get_by_rec: [
Dec 08 18:17:02 269439 [B2562BB0] -> <-- Released lock 0x8d79c20 on bind handle 0x8d79c10
Dec 08 18:17:02 269457 [B756ABB0] -> __dump_all_informs: [
Dec 08 18:17:02 269476 [B2562BB0] -> osmv_dispatch_mad: ]
Dec 08 18:17:02 269496 [B756ABB0] -> InformInfo dump:
                                gid.....................0x0000000000000000 : 0x0000000000000000
                                lid_range_begin.........0x0
                                lid_range_end...........0x0
                                is_generic..............0x0
                                subscribe...............0x1
                                trap_type...............0x0
                                dev_id..................0x0
                                qpn.....................0x000001
                                resp_time_val...........0x0
                                vendor_id...............0x000000
Dec 08 18:17:02 269513 [B2562BB0] -> __osmv_ibms_receiver_callback: ]
Dec 08 18:17:02 269532 [B756ABB0] -> __dump_all_informs: ]
Dec 08 18:17:02 269566 [B756ABB0] -> osm_infr_get_by_rec: Looking for Inform Record
Dec 08 18:17:02 269582 [B756ABB0] -> InformInfo dump:
                                gid.....................0x0000000000000000 : 0x0000000000000000
                                lid_range_begin.........0x0
                                lid_range_end...........0x0
                                is_generic..............0x0
                                subscribe...............0x0
                                trap_type...............0x0
                                dev_id..................0x0
                                qpn.....................0x000001
                                resp_time_val...........0x0
                                vendor_id...............0x000000
Dec 08 18:17:02 269625 [B756ABB0] -> osm_infr_get_by_rec: InformInfo list size 1
Dec 08 18:17:02 269650 [B756ABB0] -> __match_inf_rec: [
Dec 08 18:17:02 269673 [B756ABB0] -> __match_inf_rec: Differ by Address
Dec 08 18:17:02 269698 [B756ABB0] -> __match_inf_rec: ]
Dec 08 18:17:02 269724 [B756ABB0] -> osm_infr_get_by_rec: ]
Dec 08 18:17:02 269751 [B756ABB0] -> osm_infr_rcv_process_set_method: ERR 4307: Failed to UnSubscribe to non existing inform object

Dec 08 18:17:02 269914 [B756ABB0] -> SA MAD dump:
                                base_ver................0x1
                                mgmt_class..............0x3
                                class_ver...............0x2
                                method..................0x81 (SubnAdmGetResp)
                                status..................0x200
                                resv....................0x0
                                trans_id................0x360600000033
                                attr_id.................0x3 (InformInfo)

It looks like the OpenSM side fails on the following:
  if ( memcmp(&p_infr->report_addr,
              &p_infr_rec->report_addr,
              sizeof(p_infr_rec->report_addr)) )
  {
     osm_log( p_log, OSM_LOG_DEBUG,
              "__match_inf_rec: "
              "Differ by Address\n" );
     goto Exit;
  }

Not sure why that is. Guess it needs to be debugged...

> > 2. With the latest tree, do your simulation tests now work ? The
> > osm.fdbs UNREACHABLE was only a problem with the file and not with the
> > LFTs in the network.
> >   
> Yes they do.

Good.

> > 3. In terms of file format changes, the lack of any file versioning
> > makes it difficult to move these forward when the need arises. (The
> > format change to osm.mcfdbs was unintentional (not by design)).
> >   
> The issues until now were not that a file format change was required but 
> were unintentional.
> When we will have a real need to change file format I am sure we can 
> agree on adding version and change all parsers.

We will have a real need at some point. It is more likely the config
files but there may be more info to add to other files as well.

> > 4. I encourage you to look at and comment on the OpenSM patches rather
> > than waiting for them to be in the tree.
> >   
> I am sure you did not mean to, but now I have to admit my limited skills 
> in catching bugs by reading patches :-( .

Not just read, but they are there to try out as well.

> Instead on relying on bug reading I use automatic regression. I wish we 
> could agree on some regression that
> each developer will have to run before patches are committed to the trunk.

> On my side I would love to have an automatic way to include all the 
> patches posted (one at a time) run "dead or alive" check
> and provide feedback. Currently my automation is limited to testing the 
> trunk. So I will always be complaining after the patches are
> committed. I think this is the way most other components testing works.

You could try out the patches and do the same thing before they are
committed.

> What kind of regression suite do you and Sasha use?

Haven't we been over this before ? I might ask the same of you and
Yevgeny. There are similar occurrences.

I use osmtest for most of my testing as well as as a subnet on which I
perform directed tests on the functionality being changed.

Sasha does testing on both live and simulated subnets.

> Can we agree on minimal pre-commit testing?

I think we do a reasonable level of pre commit testing and have been
responsive to breakages not necessarily of our own making.

> Can we have a branch for that sake where all patches will first have to 
> go into for 2 days? (it will allow for pre-trunk testing).

That's why the patches go out first. The patches in question were out
there for over a week.

This seems like another level of overhead to me. Is there real gain here
?

-- Hal

> > Thanks for your help in finding the bugs sooner.
> >
> > -- Hal
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 





More information about the general mailing list