[openib-general] max_send_sge < max_sge

Tue Jun 27 13:34:33 PDT 2006

Thomas.Talpey at netapp.com wrote on Tue, 27 Jun 2006 09:06 -0400:
> At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote:
> >Unless you use it, passing the absolute maximum value supported by 
> >hardware does
> >not seem, to me, to make sense - it will just slow you down, and waste
> >resources.  Is there a protocol out there that actually has a use for 30 sge?
> 
> It's not a protocol thing, it's a memory registration thing. But I agree,
> that's a huge number of segments for send and receive. 2-4 is more
> typical. I'd be interested to know what wants 30 as well...

This is the OpenIB port of pvfs2: http://www.pvfs.org/pvfs2/download.html
See pvfs2/src/io/bmi/bmi_ib/openib.c for the bottom of the transport
stack.  The max_sge-1 aspect I'm complaining about isn't checked in yet.

It's a file system application.  The MPI-IO interface provides
datatypes and file views that let a client write complex subsets of
the in-memory data to a file with a single call.  One case that
happens is contiguous-in-file but discontiguous-in-memory, where the
file system client writes data from multiple addresses to a single
region in a file.  The application calls MPI_File_write or a
variant, and this complex buffer description filters all the way
down to the OpenIB transport, which then has to figure out how to
get the data to the server.

These separate data regions may have been allocated all at once
using MPI_Alloc_mem (rarely), or may have been used previously for
file system operations so are already pinned in the registration
cache.  Are you implying there is more memory registration work that
has to happen beyond making sure each of the SGE buffers is pinned
and has a valid lkey?

It would not be a major problem to avoid using more than a couple of
SGEs; however, I didn't see any reason to avoid them.  Please let me
know if you see a problem with this approach.

		-- Pete