[ofa-general] SubnAdmGet (6777)

Bob Ciotti Bob.Ciotti at nasa.gov
Thu May 28 16:41:33 PDT 2009


On Thu, May 28, 2009 at 02:06:38PM -0500, Hal Rosenstock wrote:
> On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
> >
> > Sorry to bounce this off the list - should it be too remedial. I promise
> > that I've been consuming a lot of the spec and OFA code. Maybe you consider
> > that a promise or a warning we will be more active :|
> >
> > Our configuration is >6000 CA in a mix of infinihostIII/connectx and
> > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE
> > with lots of other stuff plugged in). Its DDR everywhere except across the
> > longbows. Hosts range from a few different generations of x86 xeon, x86
> > opteron and itanium. We use lustre but have the srp traffic on a separate
> > subnet.
> >
> > A few weeks ago connection setup times were mentioned on this list along
> > with ARP and path record lookups not being scalable. We experience these
> > problems as well and need to address these scalability issues. I have a quite
> > a bit of test data and a few different ideas to bounce off the list RE path
> > records, once I am a little more versed in the spec. There has already been
> > some work done to limit ARP traffic.
> >
> > Todays question has to do with SM errors.
> > We have been seeing lots of these - sometimes more than others. Digging
> > around some it appears that the 6777 represents the number of duplicates?
> > This value fluctuates around some, but not alot. Comments in the code
> > indicate that any valuse >1 is a problem. Question is, should or is this
> > OK to be happening and how does it occur?
> 
> It's an error (and error status of too many records is returned to the
> SA client in the end node).
> 
> Gets are only allowed to return 1 record (GetTable requests can deal
> with more than 1 record in the response) yet many were found by the SA
> that satisfied the request in responding to the Get. Any idea on what
> the specific get is that causes this to occur ?

 Thats the problem. The at the debug level we are running at I can pin down 
the source. Is there a state I can go look for on the clients to see what 
its trying to do?

bob


> -- Hal
> 
> > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days.
> > We are currently running a pre 1.4 top of tree pull from back in dec. bob
> >
> >
> > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> >
> > ....
> >
> >
> >
> > -------------------------------------------------------------------------
> > Robert B. Ciotti ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Supercomputing Systems Lead
> > NASA Advanced Supercomputing (NAS) Division ?? ?? ?? ?? ?? ??TEL (650) 604-4408
> > NASA Ames Research Center ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??FAX (650) 604-4377
> > Moffett Field, CA 94035-1000 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Bob.Ciotti at NASA.gov
> > -------------------------------------------------------------------------
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >



More information about the general mailing list