[ofa-general] Re: local QP operation error after long run

Michael S. Tsirkin mst at dev.mellanox.co.il
Thu Aug 30 08:10:59 PDT 2007


It seems inline flag was set in the WR.
That's all I know.

Quoting Tang, Changqing <changquing.tang at hp.com>:
Subject: RE: local QP operation error after long run


>From our code, num_sge=1 all the time.

But from error message,  can you figure out num_sge is actually 0 ? and
with inline flag ?


--CQ
 

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] 
> Sent: Thursday, August 30, 2007 8:59 AM
> To: Tang, Changqing
> Cc: Roland Dreier; Michael S. Tsirkin; general at lists.openfabrics.org
> Subject: Re: local QP operation error after long run
> 
> 
> Apparently, an inline work request is malformed.
> Yes, this could indicate memory corruption.
> OTOH, I see this in commit history:
> commit c2623102f3e38e7684e435b77403d16dc6ddb585
> Author: Roland Dreier <rolandd at cisco.com>
> Date:   Mon Nov 28 21:21:08 2005 +0000
> 
>     Fix inline sends with no gather entries
> 
>     Fix bug in handling send requests that have the inline flag set
>     but do not include any gather entries.
> 
>     Signed-off-by: Roland Dreier <rolandd at cisco.com>
> 
> is there a chance you are posting some 0-size WRs?
> If yes, just clearing the inline flag will fix it.
> 
> 
> 
> Quoting Tang, Changqing <changquing.tang at hp.com>:
> Subject: local QP operation error after long run
> 
> 
> HI,
> 	I have an ISV application running for nearly three 
> hours, and then it has following error from libibverbs.so:
> 
> local QP operation err (QPN 440446, WQE @ 00000103, CQN 10008c, index
> 236192)
>   [ 0] 00440446
>   [ 4] 00000000
>   [ 8] 00000000
>   [ c] 00000000
>   [10] 026f0000
>   [14] 00000000
>   [18] 00000103
>   [1c] ff100000
> 
> local QP operation err (QPN 440442, WQE @ 00000103, CQN 10008c, index
> 236193)
>   [ 0] 00440442
>   [ 4] 00000000
>   [ 8] 00000000
>   [ c] 00000000
>   [10] 026f0000
>   [14] 00000000
>   [18] 00000103
>   [1c] ff100000 
> 
> Can you guys indicate what the possible reason is ? this is 
> an OFED 1.1 system. Could it be a memory corruption ?
> 
> Thanks
> --CQ, HP-MPI
> 
> 
> 
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Roland 
> > Dreier
> > Sent: Wednesday, August 29, 2007 9:50 PM
> > To: Sasha Khapyorsky
> > Cc: general at lists.openfabrics.org
> > Subject: Re: [ofa-general] ib_umad method mask problems on 
> big-endian 
> > 64-bitarchs
> > 
> >  > It looks that using uint32_t for addr in set_bit() function is 
> > sufficient  > fix. But for ppc64 this means that new OpenSM 
> will break 
> > with old  > kernels, probably we will need to put some ugly 
> #ifdef in  
> > > osm_vendor_ibumad.c...
> > 
> > Yes, that's a pain.  Another possibility is to declare that the 
> > declaration of the registration request should have been
> > 
> > 	long	method_mask[16 / sizeof (long)];
> > 
> > and just add a compat_ioctl method to the ib_umad module to 
> handle the 
> > broken case of 32-bit big endian userspace on a 64-bit kernel.
> > However that breaks 64-bit big endian userspace that 
> followed the old 
> > ib_user_mad.h file correctly so overall I'm leaning towards 
> the patch 
> > I already posted.
> > 
> > What do you think?
> > 
> >  - R.
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 
> --
> MST
> 

-- 
MST



More information about the general mailing list