[openib-general] Getting rid of pinned memory requirement

Mon Mar 14 17:35:31 PST 2005

> -----Original Message-----
> From: Troy Benjegerdes [mailto:hozer at hozed.org] 
> Sent: Monday, March 14, 2005 5:06 PM
> To: Caitlin Bestler
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] Getting rid of pinned memory requirement
> 
> > 
> > The key is that the entire operation either has to be fast 
> > enough so that no connection or application session layer
> > time-outs occur, or an end-to-end agreement to suspend the
> > connetion is a requirement. The first option seems more
> > plausible to me, the second essentially 
> > reuqires extending the CM protocol. That's a tall order even for 
> > InfiniBand, and it's even worse for iWARP where the CM 
> > functionality typically ends when the connection is established.
>  
> I'll buy the good network design argument.
> 
> I suppose if the kernel wants to revoke a card's pinned 
> memory, we should be able to guarantee that it gets new 
> pinned memory within a bounded time. What sort of timing do 
> we need? Milliseconds?
> Microseconds?
>
> In the case of iWarp, isn't this just TCP underneath? If so, 
> can't we just drop any packets in the pipe on the floor and 
> let them get retransmitted? (I suppose the same argument goes 
> for infiniband..
> what sort of a time window do we have for retransmission?)
> 
> What are the limits on end-to-end flow control in IB and iWarp?
> 

>From the RDMA Provider's perspective, the short answer is
"quick enough so that I don't have to do anything heroic
to keep the connection alive."

With TCP you also have to add "and healthy". If you've ever
had a long download that got effectively stalled by a burst
of noise and you just hit the 'reload' button on your browser
then you know what I'm talking about.

But in transport neutral terms I would think that
one RTT is definitely safe -- that much data could have
been dropped by one switch failure or one nasty spike in
inbound noise.

> > 
> > Yes, there are limits on how much memory you can mlock, or even 
> > allocate. Applications are required to reqister memory precisely 
> > because the required guarantess are not there by default. 
> Eliminating 
> > those guarantees *is* effectively rewriting every RDMA application 
> > without even letting them know.
> 
> Some of this argument is a policy issue, which I would argue 
> shouldn't be hard-coded in the code or in the network hardware.
> 
> At least in my view, the guarantees are only there to make 
> applications go fast. We are getting low latency and high 
> performance with infiniband by making memory registration go 
> really really slow. If, to make big HPC simulation 
> applications work, we wind up doing memcpy() to put the data 
> into a registered buffer because we can't register half of 
> physical memory, the application isn't going very fast.
>

What you are looking for is a distinction between registering
memory to *enable* the RNIC to optimize local access and 
registering memory to enable its being advertised to the
remote end.

Early implementations of RDMA, both IB and iWARP, have not
distinquished between the two. But theoretically *applications*
do not need memory regions that are not enabled for remote
access to be pinned. That is an RNIC requirement that could
evolve. But applications themselves *do* need remotely
accessible memory regions, portions of which they intend
to advertise with RKeys, to be truly available (i.e., pinned).

You are also making a policy assumption that an application
that actually needs half of physical memory should be using
paged memory. Memory is cheap, and if performance is critical
why should this memory be swapped out to disk?

Is the limitation on not being able to register half of
physical memory based upon some assumption that swapping
is a requirement? Or is it a limitation in the memory region
size? If it's the latter, you need to get the OS to support
larger page sizes.