[openib-general] How about ib_send_page() ?

Tue May 17 19:08:16 PDT 2005

On Tue, May 17, 2005 at 06:32:38PM -0700, Jeff Carr wrote:
> >>>But IPoIB can't really implement NAPI since it's sending work to
> >>>a shared HCA. 
> 
> Hmm. I'm not knowledgeable to know why; I'll have to take your word for 
> it. I'm not sure yet all the conditions that the HCA can generate 
> interrupts.

Wellll..Looks like I'm wrong. Previous email on this thread
suggested it's possible by people who know alot more about
it than I do. But I'm still concerned it's going to affect
latency.

> 
> But if I sit back and look at the logic of this arguement then it seemed 
> like:
> 
> Hey. is there was a way to not generate so many interrupts?
> That's handled by NAPI
> OK. That looks interesting.

Right - but that's the generic "this is how linux deals with this" argument.

> But, we can't do NAPI because we can't just disable interrupts.

Sorry - seems like I'm assuming too much about the capabilities
of the HCAs.

> Darn.
> But wait, why can't we just not generate interrupts in the first place then?
> 
> Isn't that what the midas touch of netdev->poll() really is? e1000 has:
> quit_polling:   netif_rx_complete(netdev);
>                 e1000_irq_enable(adapter);

I'm more familiar with tg3 driver. The disadvantage to NAPI in the
tg3 implementation is it *alays* disables interrupts on the card before
calling netif_rx_schedule(). Then it lets the OS decide when to
actually process those packets *already* received in a safer
context....then once tg3 decides it's done all the work
it re-eables the interrupts. Just like e1000 does above.

There are some workloads where the PCI bus utilization is "suboptimal"
because the enable/disable of interrupts interfers with the DMA flows
and costs excessive overhead.

> Maybe IB can mimic the concept here by acting intellegently for us? 
> Have disable_rx_and_rxnobuff_ints() only disable interrupts for the 
> IPoIB ULP? Anyway, my knowledge here still sucks so I'm probably so far 
> off base I'm not even on the field. Either way it's fun digging around here.

Based on previous comments, I'm hoping that's the case.
But I don't know either.

> >One can.  Using SDP, netperf TCP_STREAM measured 650 MB/s using the
> >regular PCI-X card. 
> 
> Yes, I have the same speed results using perf_main().
> 
> The perf_main() test isn't that interesting I think though. It really 
> just transfers the exact same memory window across two nodes. (at least 
> as far as I can tell that is what it does).
> 
> Anyway, I'm just noticing that this simple dd test from memory doesn't 
> go much over 1GB/sec. So this is an interesting non-IB problem.
>
> root at jcarr:/# dd if=/dev/shm/test of=/dev/null bs=4K
> 196608+0 records in
> 196608+0 records out
> 805306368 bytes transferred in 0.628504 seconds (1281306571 bytes/sec)

Yeah. Sounds like there is. Should be able to do several GB/s like that.

I suppose it's possibly an issue with the Memory controller too.
ZX1 interleaves accesses across 4 DIMMs to get the memory bandwidth.
Might check to make sure your box is "optimally" configured too.

grundler at iota$ dd if=/dev/shm/test of=/dev/null bs=4K
dd: opening `/dev/shm/test': No such file or directory

Sorry - what do I need to do to create /dev/shm/test?

I should probably "cheat" and use 16KB block since that
is the native page size on ia64.

thanks,
grant