[Users] RDMA issues in ib_qib [was:IPoIB on CentOS 6.5]

Peter Kjellström cap at nsc.liu.se
Tue Mar 24 06:05:55 PDT 2015


On Mon, 23 Mar 2015 22:01:42 +0000
"Foraker, Jim" <foraker1 at llnl.gov> wrote:

> On 3/23/15, 2:24 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
> 
> >On Thu, 19 Mar 2015 16:17:09 +0000
> >"Foraker, Jim" <foraker1 at llnl.gov> wrote:
> >
> >> Peter,
> >>      Thanks.  I¹ve told our RedHat folks that the IPoIB issue is a
> >> high priority for us.  Our bug for the qib kernel RDMA issue is
> >> 1188417, which was closed as a duplicate of
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1171803.
> >
> >And (no surprise) both those are non-public. Can you give a short
> >summary (root cause, work-around, fix planned for, ..)?
>      Sorry, I prodded one of our RH folks to make our BZ public
> again.  It should be visible now.  In case it¹s not, the problem
> boils down to to a snippet of qib_mag_sg() in qib_dma.c:
> 
> #ifdef CONFIG_NEED_SG_DMA_LENGTH
> 		sg->dma_length = sg->length
> #endif
> 
>      CONFIG_NEED_SG_DMA_LENGTH is a config option from more recent
> kernels not present in RHEL6.  It appears to have been inadvertently
> brought in while backporting an upstream patch.  (struct
> scatterlist).dma_length should be getting set in RHEL6, but because
> of the ifdef, it¹s not. Remove the ifdef and kernel RDMA works fine
> again.

Many thanks, we'll probably include that patch in our local rebuilds.

Out of curiosity, which infinipath-psm do you match the ib_qib from the
rhel-6.6 kernel with?

We noticed the ipath_userinit complaining about version mismatch
(kernel too new) and rebuilt a newer upstream version. This caused the
warning to go away but MPI behavior to degrade (hangs during MPI
teardown mostly). Currently we run with a locally patched version that
simply demoted that version mismatch warning to a debug print...

/Peter



More information about the Users mailing list