[ofa-general] SubnAdmGet (6777)

Hal Rosenstock hal.rosenstock at gmail.com
Thu May 28 12:06:38 PDT 2009


On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
>
> Sorry to bounce this off the list - should it be too remedial. I promise
> that I've been consuming a lot of the spec and OFA code. Maybe you consider
> that a promise or a warning we will be more active :|
>
> Our configuration is >6000 CA in a mix of infinihostIII/connectx and
> longbow extenders and >800 24 port switches on a single subnet. (SGI ICE
> with lots of other stuff plugged in). Its DDR everywhere except across the
> longbows. Hosts range from a few different generations of x86 xeon, x86
> opteron and itanium. We use lustre but have the srp traffic on a separate
> subnet.
>
> A few weeks ago connection setup times were mentioned on this list along
> with ARP and path record lookups not being scalable. We experience these
> problems as well and need to address these scalability issues. I have a quite
> a bit of test data and a few different ideas to bounce off the list RE path
> records, once I am a little more versed in the spec. There has already been
> some work done to limit ARP traffic.
>
> Todays question has to do with SM errors.
> We have been seeing lots of these - sometimes more than others. Digging
> around some it appears that the 6777 represents the number of duplicates?
> This value fluctuates around some, but not alot. Comments in the code
> indicate that any valuse >1 is a problem. Question is, should or is this
> OK to be happening and how does it occur?

It's an error (and error status of too many records is returned to the
SA client in the end node).

Gets are only allowed to return 1 record (GetTable requests can deal
with more than 1 record in the response) yet many were found by the SA
that satisfied the request in responding to the Get. Any idea on what
the specific get is that causes this to occur ?

-- Hal

> We will probably do an update to the 1.4 or 1.4.1 SM in the next few days.
> We are currently running a pre 1.4 top of tree pull from back in dec. bob
>
>
> May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>
> ....
>
>
>
> -------------------------------------------------------------------------
> Robert B. Ciotti                              Supercomputing Systems Lead
> NASA Advanced Supercomputing (NAS) Division            TEL (650) 604-4408
> NASA Ames Research Center                              FAX (650) 604-4377
> Moffett Field, CA 94035-1000                          Bob.Ciotti at NASA.gov
> -------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>



More information about the general mailing list