[openib-general] Lustre over OpenIB Gen2
Roland Dreier
rolandd at cisco.com
Thu Nov 10 11:50:07 PST 2005
Hi Eric... writing YAN (yet another NAL) I see :)
Eric> 2. I'd like to scale to >= 10,000 peer nodes; 1 RC QP per
Eric> peer. Is this going to get me into trouble?
Eric> For example, I currently create a single PD and CQ for
Eric> everything, however the example I've seen (cmatose.c)
Eric> appears to create these separately for each peer. Is that
Eric> what I should be doing too?
I don't think you want 10K PDs. But having a single CQ big enough to
handle 10K QPs might be a problem.
Eric> 3. Is contiguous memory allocation an issue in Gen2? Since
Eric> this is such a scarce resource in the kernel (and particular
Eric> CQ usage with one vendor's stack relied heavily on it) what
Eric> red flags should I be aware of?
There are still a few places where you can get in trouble (for
example, with the mthca driver, extremely large QP work queues might
be a problem, because the driver allocates contiguous memory for the
array used to track work request IDs -- not the work queues themselves
though). But CQs in particular should be fine.
Eric> 4. Are RDMA reads still deprecated? Which resources hit the
Eric> spotlight if I chose to use them?
I don't think RDMA reads were ever really deprecated. But RDMA writes
probably pipeline better.
Eric> 5. Should I pre-map all physical memory and do RDMA in
Eric> page-sized fragments? This avoids any mapping overhead at
Eric> the expense of having much larger numbers of queued RDMAs.
Eric> Since I try to keep up to 8 (by default) 1MByte RDMAs active
Eric> concurrently to any individual peer, with 4k pages I can
Eric> have up to 2048 RDMA work items queued at a time per peer.
Eric> And if I pre-map, can I be guaranteed that if I put the
Eric> CQ into the error state, all remote access to my memory is
Eric> revoked (e.g. could a CQ I create after I destroy the one I
Eric> just shut down somehow alias with it such that a
Eric> pathalogically delayed RDMA could write my memory)?
s/CQ/QP/ ... anyway, if you choose your receive queue sequence numbers
randomly, then the probability of a QP number/sequence number
collision allowing a stray RDMA is astronomically low (effectively 0).
Eric> Or is it better to use FMR pools and take the map/unmap
Eric> overhead? If so, is there a way to know when the unmap
Eric> actually hits the hardware and my memory is safe?
FMRs are only supported on Mellanox HCAs at the moment. But they do
have some advantages, like allowing you to convert a bunch of pages
into a single virtually contiguous region. You can use the
ib_flush_fmr_pool() function to make sure that all unmapped FMRs are
really and truly flushed, but that is a slow operation (since it
incurs the penalty of flushing all in-flight operations in the HCA).
Eric> 6. Does Gen2 present substantially the same APIs as the
Eric> kernel in userspace? So if I wrote a userspace equivalent
Eric> of my kernel driver, could I have pure userspace clients
Eric> talk to kernel servers?
Pretty much so, except of course userspace doesn't have access to
physical memory or FMRs.
- R.
More information about the general
mailing list