[Fwd: Re: [openib-general] Re: IBM eHCA testing..]

Hal Rosenstock halr at voltaire.com
Fri Oct 14 13:30:32 PDT 2005


On Fri, 2005-10-14 at 16:22, Brett Bode wrote:
> On Oct 14, 2005, at 2:27 PM, Troy Benjegerdes wrote:
> 
> >
> >
> > From: Hal Rosenstock <halr at voltaire.com>
> > Date: October 14, 2005 12:41:13 PM CDT
> > To: Troy Benjegerdes <troy at scl.ameslab.gov>
> > Cc: IBMEHCA DD <IBMEHCAD at de.ibm.com>, openib-general at openib.org
> > Subject: Re: [openib-general] Re: IBM eHCA testing..
> >
> >
> > On Fri, 2005-10-14 at 12:08, Troy Benjegerdes wrote:
> >> Hal Rosenstock wrote:
> >>
> >>> On Thu, 2005-10-13 at 18:46, Troy Benjegerdes wrote:
> >>>
> >>>
> >>>> I'm also attaching part of an opensm log file.
> >>>>
> >>>> (the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log )
> >>>>
> >>>> The IBM galaxy adapters are at:
> >>>> 	Initial path: [0][1][16]
> >>>> 	Initial path: [0][1][13]
> >>>>
> >>>>
> >>>>
> >>>
> >>> The OpenSM is just saying that a SMP transaction it issued (in this
> >>> case, SM Get P_KeyTable) is timing out (no response made it back to
> >>> OpenSM).
> >>>
> >>> BTW, what svn rev is OpenSM up to ?
> >>>
> >>> -- Hal
> >>>
> >>>
> >> So, how about a patch to opensm to report what svn rev it was built 
> >> from ;)
> >
> > Can you do svn info in the userspace/management/osm directory ?
> 
> Path: .
> URL: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband
> Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd
> Revision: 3493
> Node Kind: directory
> Schedule: normal
> Last Changed Author: roland
> Last Changed Rev: 3487
> Last Changed Date: 2005-09-19 17:59:27 -0500 (Mon, 19 Sep 2005)
> Properties Last Updated: 2005-02-15 16:24:20 -0600 (Tue, 15 Feb 2005)

If you update and rebuild OpenSM, you will get rid of messages like:

Oct 13 10:35:38 366848 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x8:0x0] for guid:0x0002c90108ccc571.
Oct 13 10:35:38 366866 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x13:0x1] for guid:0x0002550000039e80.
Oct 13 10:35:38 366880 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x5:0x0] for guid:0x00066a00a0000441.
Oct 13 10:35:38 366894 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x10:0x1] for guid:0x0002c90108cd0b71.
Oct 13 10:35:38 366907 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x11:0x1] for guid:0x00066a00a000044e.
Oct 13 10:35:38 366921 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x14:0x1] for guid:0x0002550000038500.
Oct 13 10:35:38 366934 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x9:0x0] for guid:0x0002c90200402782.
Oct 13 10:35:38 366948 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xa:0x0] for guid:0x0002c90108cd98c1.
Oct 13 10:35:38 366961 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xd:0x0] for guid:0x0002c90108cd84a1.
Oct 13 10:35:38 366975 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xe:0x0] for guid:0x0002c90200402917.
Oct 13 10:35:38 366988 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x1:0x0] for guid:0x0002c90200402781.
Oct 13 10:35:38 367001 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xb:0x0] for guid:0x0002c90108cd9bd1.
Oct 13 10:35:38 367015 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x15:0x1] for guid:0x0002550000038580.
Oct 13 10:35:38 367028 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x2:0x0] for guid:0x0002c90200402915.
Oct 13 10:35:38 367042 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x6:0x0] for guid:0x00066a00a0000444.
Oct 13 10:35:38 367055 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x4:0x0] for guid:0x00066a00a000043c.
Oct 13 10:35:38 367068 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xc:0x0] for guid:0x0002c90108cd85f1.
Oct 13 10:35:38 367082 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x7:0x0] for guid:0x00066a00a0000458.
Oct 13 10:35:38 367095 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x3:0x0] for guid:0x0002c900001cee10.

I don't think this causes any harm though. There are some other fixes
you will pick up.

> >> I just discovered another problem.. We have been running pfvs2 over
> >> IPoIB on the same subnet, and in debugging this, I restarted opensm
> >> several times, and somewhere in the stack a PVFS2 write failed. I
> >> wouldn't think that a short downtime of the SM from restarting it 
> >> would
> >> cause any IPoIB TCP sessions to fall over..
> >
> > As Fab indicated, there are a number of places where the SM/SA is
> > needed:
> > 1. SA PathRecords (used when a path to a new IP end node is needed or 
> > an
> > existing one timesout)
> > 2. SA MCMemberRecord joins, queries, and leaves (used when an interface
> > is up'ed, down'ed, etc.)
> >
> > Is this on an existing TCP session ? Is it OpenIB IPoIB clients at each
> > end ? What svn version is being used for this ?
> >
> > -- Hal
> >
> It looks like each client node maintains an open TCP stream to each of 
> the servers. pvfs2 appears to not be very robust to failure. However 
> the pvfs2 folks just released a new version which changes their network 
> protocol somewhat. I plan to get the new version installed next week 
> and will see if it handles things a bit more robustly.

Is this running on top of OpenIB IPoIB ? If so, what svn version for
IPoIB ? Is it the same as OpenSM (3487) ? If so, that should be recent
enough and contains the SA reregistration fix for IPoIB.

cd linux-kernel/infiniband/ulp/ipoib/
svn info

-- Hal




More information about the general mailing list