[ofa-general] SubnAdmGet (6777)

Hal Rosenstock hal.rosenstock at gmail.com
Fri May 29 06:09:49 PDT 2009


On Thu, May 28, 2009 at 7:41 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
> On Thu, May 28, 2009 at 02:06:38PM -0500, Hal Rosenstock wrote:
>> On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
>> >
>> > Sorry to bounce this off the list - should it be too remedial. I promise
>> > that I've been consuming a lot of the spec and OFA code. Maybe you consider
>> > that a promise or a warning we will be more active :|
>> >
>> > Our configuration is >6000 CA in a mix of infinihostIII/connectx and
>> > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE
>> > with lots of other stuff plugged in). Its DDR everywhere except across the
>> > longbows. Hosts range from a few different generations of x86 xeon, x86
>> > opteron and itanium. We use lustre but have the srp traffic on a separate
>> > subnet.
>> >
>> > A few weeks ago connection setup times were mentioned on this list along
>> > with ARP and path record lookups not being scalable. We experience these
>> > problems as well and need to address these scalability issues. I have a quite
>> > a bit of test data and a few different ideas to bounce off the list RE path
>> > records, once I am a little more versed in the spec. There has already been
>> > some work done to limit ARP traffic.
>> >
>> > Todays question has to do with SM errors.
>> > We have been seeing lots of these - sometimes more than others. Digging
>> > around some it appears that the 6777 represents the number of duplicates?
>> > This value fluctuates around some, but not alot. Comments in the code
>> > indicate that any valuse >1 is a problem. Question is, should or is this
>> > OK to be happening and how does it occur?
>>
>> It's an error (and error status of too many records is returned to the
>> SA client in the end node).
>>
>> Gets are only allowed to return 1 record (GetTable requests can deal
>> with more than 1 record in the response) yet many were found by the SA
>> that satisfied the request in responding to the Get. Any idea on what
>> the specific get is that causes this to occur ?
>
>  Thats the problem. The at the debug level we are running at I can pin down
> the source.

Can you change the debug level ? If not, can you instrument OpenSM
(add some debug info into osm_sa_path_record.c) ?

> Is there a state I can go look for on the clients to see what
> its trying to do?

Perhaps use madeye.

-- Hal

> bob
>
>
>> -- Hal
>>
>> > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days.
>> > We are currently running a pre 1.4 top of tree pull from back in dec. bob
>> >
>> >
>> > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> >
>> > ....
>> >
>> >
>> >
>> > -------------------------------------------------------------------------
>> > Robert B. Ciotti ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Supercomputing Systems Lead
>> > NASA Advanced Supercomputing (NAS) Division ?? ?? ?? ?? ?? ??TEL (650) 604-4408
>> > NASA Ames Research Center ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??FAX (650) 604-4377
>> > Moffett Field, CA 94035-1000 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Bob.Ciotti at NASA.gov
>> > -------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > general mailing list
>> > general at lists.openfabrics.org
>> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >
>> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>> >
>



More information about the general mailing list