[openib-general] Lustre over OpenIB Gen2

Thu Nov 10 11:50:07 PST 2005

Hi Eric... writing YAN (yet another NAL) I see :)

    Eric> 2. I'd like to scale to >= 10,000 peer nodes; 1 RC QP per
    Eric> peer.  Is this going to get me into trouble?

    Eric>    For example, I currently create a single PD and CQ for
    Eric> everything, however the example I've seen (cmatose.c)
    Eric> appears to create these separately for each peer.  Is that
    Eric> what I should be doing too?

I don't think you want 10K PDs.  But having a single CQ big enough to
handle 10K QPs might be a problem.

    Eric> 3. Is contiguous memory allocation an issue in Gen2?  Since
    Eric> this is such a scarce resource in the kernel (and particular
    Eric> CQ usage with one vendor's stack relied heavily on it) what
    Eric> red flags should I be aware of?

There are still a few places where you can get in trouble (for
example, with the mthca driver, extremely large QP work queues might
be a problem, because the driver allocates contiguous memory for the
array used to track work request IDs -- not the work queues themselves
though).  But CQs in particular should be fine.

    Eric> 4. Are RDMA reads still deprecated?  Which resources hit the
    Eric> spotlight if I chose to use them?

I don't think RDMA reads were ever really deprecated.  But RDMA writes
probably pipeline better.

    Eric> 5. Should I pre-map all physical memory and do RDMA in
    Eric> page-sized fragments?  This avoids any mapping overhead at
    Eric> the expense of having much larger numbers of queued RDMAs.
    Eric> Since I try to keep up to 8 (by default) 1MByte RDMAs active
    Eric> concurrently to any individual peer, with 4k pages I can
    Eric> have up to 2048 RDMA work items queued at a time per peer.

    Eric>    And if I pre-map, can I be guaranteed that if I put the
    Eric> CQ into the error state, all remote access to my memory is
    Eric> revoked (e.g. could a CQ I create after I destroy the one I
    Eric> just shut down somehow alias with it such that a
    Eric> pathalogically delayed RDMA could write my memory)?

s/CQ/QP/ ... anyway, if you choose your receive queue sequence numbers
randomly, then the probability of a QP number/sequence number
collision allowing a stray RDMA is astronomically low (effectively 0).

    Eric>    Or is it better to use FMR pools and take the map/unmap
    Eric> overhead?  If so, is there a way to know when the unmap
    Eric> actually hits the hardware and my memory is safe?

FMRs are only supported on Mellanox HCAs at the moment.  But they do
have some advantages, like allowing you to convert a bunch of pages
into a single virtually contiguous region.  You can use the
ib_flush_fmr_pool() function to make sure that all unmapped FMRs are
really and truly flushed, but that is a slow operation (since it
incurs the penalty of flushing all in-flight operations in the HCA).

    Eric> 6. Does Gen2 present substantially the same APIs as the
    Eric> kernel in userspace?  So if I wrote a userspace equivalent
    Eric> of my kernel driver, could I have pure userspace clients
    Eric> talk to kernel servers?

Pretty much so, except of course userspace doesn't have access to
physical memory or FMRs.

 - R.