[openib-general] HP ZX1 and HP IB cards...

Mon Dec 6 12:16:12 PST 2004

On Mon, Dec 06, 2004 at 10:59:51AM -0800, Roland Dreier wrote:
>     Grant> Yes - that works. Please commit.
> 
> I rewrote things in a way that seems cleaner to me -- what I actually
> committed is below.  Please try one more time and make sure this still
> fixes the problem.

will do in a bit.

...
>     Grant> this wmb() isn't necessary since set_eq_ci() calls
>     Grant> mthca_write64() and the latter *must* enforce wmb() for the
>     Grant> MMIO write.
> 
> I'm not sure about that... mthca_write64() turns into __raw_writeq,
> which may not be ordered.  Even if it were writeq(), the barrier is
> after the write (eg on ppc64, the "sync" if after the store
> operation).  We want to make sure that the updates to the ownership
> bits in the EQ in host memory are complete before doing the MMIO write
> to update the HCA's consumer pointer; this avoids the (pretty much
> impossible) race where the HCA writes to the EQ entry and then our
> ownership update happens later and overwrites the HCA's value.

Yeah - ia64 has slight differences in that the "release" semantics
used for writeX() (and wmb()) force prior write mem ops to complete
before this one does. I had forgotten exactly what the linux
API semantics where and re-read Documentation/DocBook/deviceiobook.tmpl.
It doesn't seem to address this issue unfortunately.
The best description I have of memory ordering is still
"IA64 Linux Kernel" by David Mosberger and Stephane Eranian.
And that supports exactly what you say above.

>     Grant> I'd also like to see doorbell[2] go away and just pass the
>     Grant> two u32 values to mthca_write64().  Perhaps combine them
>     Grant> explicitly instead depending on load/store model of the
>     Grant> stack.
> 
> That makes sense.  I'll try to come up with an API that avoids shifts
> on a 64-bit arch...

I'm not sure there is one.  And it still might be better to use the
shift op.  This feels like one of those perf vs portability
issue where someone has to decide if it's worth trading off some perf
for clearer code.

Thinking about it more, I just realized the two 32-bit stores are to a
cacheline already resident in L1.  Ditto for the successive load to
recover the 64-bit value.
The mem ops are nearly "free" since it's to a resident, private cacheline.

But completely avoiding those stores/loads would be good too and
I expect the resulting code will be better to read/understand.
And given the use, I don't think the shift unit will limit
parallelism on ia64 or matter on other arches. But I might be wrong
on that since it's just a guess.

thanks,
grant