[ofa-general] SubnAdmGet (6777)

Bob Ciotti Bob.Ciotti at nasa.gov
Thu May 28 10:57:57 PDT 2009


Sorry to bounce this off the list - should it be too remedial. I promise
that I've been consuming a lot of the spec and OFA code. Maybe you consider
that a promise or a warning we will be more active :|

Our configuration is >6000 CA in a mix of infinihostIII/connectx and
longbow extenders and >800 24 port switches on a single subnet. (SGI ICE
with lots of other stuff plugged in). Its DDR everywhere except across the
longbows. Hosts range from a few different generations of x86 xeon, x86
opteron and itanium. We use lustre but have the srp traffic on a separate
subnet.

A few weeks ago connection setup times were mentioned on this list along
with ARP and path record lookups not being scalable. We experience these
problems as well and need to address these scalability issues. I have a quite
a bit of test data and a few different ideas to bounce off the list RE path
records, once I am a little more versed in the spec. There has already been 
some work done to limit ARP traffic.


Todays question has to do with SM errors.  
We have been seeing lots of these - sometimes more than others. Digging
around some it appears that the 6777 represents the number of duplicates?
This value fluctuates around some, but not alot. Comments in the code
indicate that any valuse >1 is a problem. Question is, should or is this
OK to be happening and how does it occur?

We will probably do an update to the 1.4 or 1.4.1 SM in the next few days.
We are currently running a pre 1.4 top of tree pull from back in dec. bob


May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)

....



-------------------------------------------------------------------------
Robert B. Ciotti                              Supercomputing Systems Lead
NASA Advanced Supercomputing (NAS) Division            TEL (650) 604-4408
NASA Ames Research Center                              FAX (650) 604-4377
Moffett Field, CA 94035-1000                          Bob.Ciotti at NASA.gov
-------------------------------------------------------------------------




More information about the general mailing list