[ofa-general] RE: local QP operation error after long run
Tang, Changqing
changquing.tang at hp.com
Thu Aug 30 07:56:09 PDT 2007
>From our code, num_sge=1 all the time.
But from error message, can you figure out num_sge is actually 0 ? and
with inline flag ?
--CQ
> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il]
> Sent: Thursday, August 30, 2007 8:59 AM
> To: Tang, Changqing
> Cc: Roland Dreier; Michael S. Tsirkin; general at lists.openfabrics.org
> Subject: Re: local QP operation error after long run
>
>
> Apparently, an inline work request is malformed.
> Yes, this could indicate memory corruption.
> OTOH, I see this in commit history:
> commit c2623102f3e38e7684e435b77403d16dc6ddb585
> Author: Roland Dreier <rolandd at cisco.com>
> Date: Mon Nov 28 21:21:08 2005 +0000
>
> Fix inline sends with no gather entries
>
> Fix bug in handling send requests that have the inline flag set
> but do not include any gather entries.
>
> Signed-off-by: Roland Dreier <rolandd at cisco.com>
>
> is there a chance you are posting some 0-size WRs?
> If yes, just clearing the inline flag will fix it.
>
>
>
> Quoting Tang, Changqing <changquing.tang at hp.com>:
> Subject: local QP operation error after long run
>
>
> HI,
> I have an ISV application running for nearly three
> hours, and then it has following error from libibverbs.so:
>
> local QP operation err (QPN 440446, WQE @ 00000103, CQN 10008c, index
> 236192)
> [ 0] 00440446
> [ 4] 00000000
> [ 8] 00000000
> [ c] 00000000
> [10] 026f0000
> [14] 00000000
> [18] 00000103
> [1c] ff100000
>
> local QP operation err (QPN 440442, WQE @ 00000103, CQN 10008c, index
> 236193)
> [ 0] 00440442
> [ 4] 00000000
> [ 8] 00000000
> [ c] 00000000
> [10] 026f0000
> [14] 00000000
> [18] 00000103
> [1c] ff100000
>
> Can you guys indicate what the possible reason is ? this is
> an OFED 1.1 system. Could it be a memory corruption ?
>
> Thanks
> --CQ, HP-MPI
>
>
>
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Roland
> > Dreier
> > Sent: Wednesday, August 29, 2007 9:50 PM
> > To: Sasha Khapyorsky
> > Cc: general at lists.openfabrics.org
> > Subject: Re: [ofa-general] ib_umad method mask problems on
> big-endian
> > 64-bitarchs
> >
> > > It looks that using uint32_t for addr in set_bit() function is
> > sufficient > fix. But for ppc64 this means that new OpenSM
> will break
> > with old > kernels, probably we will need to put some ugly
> #ifdef in
> > > osm_vendor_ibumad.c...
> >
> > Yes, that's a pain. Another possibility is to declare that the
> > declaration of the registration request should have been
> >
> > long method_mask[16 / sizeof (long)];
> >
> > and just add a compat_ioctl method to the ib_umad module to
> handle the
> > broken case of 32-bit big endian userspace on a 64-bit kernel.
> > However that breaks 64-bit big endian userspace that
> followed the old
> > ib_user_mad.h file correctly so overall I'm leaning towards
> the patch
> > I already posted.
> >
> > What do you think?
> >
> > - R.
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
> --
> MST
>
More information about the general
mailing list