[openib-general] RHEL 4 U3 - lost completions

Bill Hartner bhartner at austin.rr.com
Mon Oct 2 11:11:02 PDT 2006


Roland Dreier wrote:
> 
>     Bill> I am testing an app in development on RHEL 4 U3 using uDAPL.
>     Bill> The app runs OK on gen1 stacks, but cannot run on any OFED
>     Bill> based stack I have tried on RHEL 4 U3.  The symptom is RDMAs
>     Bill> not getting completion.  A completion notification is sent,
>     Bill> but mthca_poll_cq() finds no completion.  I debugged the
>     Bill> problem to this: the memory for the completion queue is not
>     Bill> pinned and at some point the page struct changes *after* the
>     Bill> HCA has been handed the address of the completion queue, so
>     Bill> subsequent completions are written elsewhere in memory and
>     Bill> the app hangs waiting for completion.
> 
> The memory should be pinned by the call to  __mthca_reg_mr() in
> mthca_create_cq(), since the kernel will do get_user_pages() on the
> memory.
> 
> By any chance, does your app do fork() or system() or something like that?

At 1st, I thought that was the case, a fork, however, I do not think 
get_user_pages(), and the increment of the ref count, will guarantee the
page struct does not change for RHEL 4 U3, I need to verify that though.

I dumped the page struct in ib_umem_get() when the completion queue
memory
was 1st registered.  Then my DTO event thread, on a 10 second timeout,
would
go ahead and create another EVD (not used) so I could then dump the page
struct of the 1st completion queue again in ib_umem_get(), and sure
enough
the page struct changed.  If I wrote some code that mapped an address to
the original page struct, I would probably see the completions there.

-Bill




More information about the general mailing list