[openib-general] max_send_sge < max_sge

Wed Jun 28 05:36:51 PDT 2006

Yep, you're confirming my comment that the sge size is dependent
on the memory registration strategy (and not the protocol itself).
Because you have a pool approach, you potentially have a lot of
discontiguous regions. Therefore, you need more sge's. (You could
have the same issue with large preregistrations, etc.)

If it's just for RDMA Write, the penalty really isn't that high - you can
easily break the i/o up into separate RDMA Write ops and pump them
out in a sequence. The HCA streams them, and using unsignalled
completion on the WRs means the host overhead can be low.

For sends, it's more painful. You have to "pull them up". Do you really
need send inlines to be that big? I guess if you're supporting a writev()
api over inline you don't have much control, but even writev has a
maxiov.

The approach the NFS/RDMA client takes is basically to have a pool
of dedicated buffers for headers, with a certain amount of space for
"small" sends. This maximum inline size is typically 1K or maybe 4K
(it's configurable), and it copies send data into them if it fits. All
other operations are posted as "chunks", which are explicit protocol
objects corresponding to { mr, offset, length } triplets. The protocol
supports an arbitrary number of them, but typically 8 is plenty. Each
chunk results in an RDMA op from the server. If the server is coded
well, the RDMA streams beautifully and there is no bandwidth issue.

Just some ideas. I feel your pain.

Tom.

At 04:34 PM 6/27/2006, Pete Wyckoff wrote:
>Thomas.Talpey at netapp.com wrote on Tue, 27 Jun 2006 09:06 -0400:
>> At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote:
>> >Unless you use it, passing the absolute maximum value supported by 
>> >hardware does
>> >not seem, to me, to make sense - it will just slow you down, and waste
>> >resources.  Is there a protocol out there that actually has a use 
>for 30 sge?
>> 
>> It's not a protocol thing, it's a memory registration thing. But I agree,
>> that's a huge number of segments for send and receive. 2-4 is more
>> typical. I'd be interested to know what wants 30 as well...
>
>This is the OpenIB port of pvfs2: http://www.pvfs.org/pvfs2/download.html
>See pvfs2/src/io/bmi/bmi_ib/openib.c for the bottom of the transport
>stack.  The max_sge-1 aspect I'm complaining about isn't checked in yet.
>
>It's a file system application.  The MPI-IO interface provides
>datatypes and file views that let a client write complex subsets of
>the in-memory data to a file with a single call.  One case that
>happens is contiguous-in-file but discontiguous-in-memory, where the
>file system client writes data from multiple addresses to a single
>region in a file.  The application calls MPI_File_write or a
>variant, and this complex buffer description filters all the way
>down to the OpenIB transport, which then has to figure out how to
>get the data to the server.
>
>These separate data regions may have been allocated all at once
>using MPI_Alloc_mem (rarely), or may have been used previously for
>file system operations so are already pinned in the registration
>cache.  Are you implying there is more memory registration work that
>has to happen beyond making sure each of the SGE buffers is pinned
>and has a valid lkey?
>
>It would not be a major problem to avoid using more than a couple of
>SGEs; however, I didn't see any reason to avoid them.  Please let me
>know if you see a problem with this approach.
>
>		-- Pete