FW: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?

Joe Landman landman at scalableinformatics.com
Tue Nov 11 12:17:56 PST 2008



Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
> Well, I did not plan to test all the possible versions of the kernel;
> for sure improvements are on their way, what just confirms the
> assumption that this 'technology' is not mature yet.
> 
> With IPoIB an NFS server can easily export (for instance) up to
> 1.2GB/s (at least this is what I can measure), with the data in the
> page cache. No problem up to that point at least. I clearly

True ... but not so interesting to the actual data read/write case when 
it has to get back to spinning disk.

> understand the theoretical benefits of RDMA and it's a clear
> improvement over TCP, for MPI. However, the drastic change for MPI is
> even more on the latency side, though the peak message bandwidth is
> also improved as one might expect for NFS. 

Again, true, though NFS has to walk through transport protocol layers as 
well as NFS application layers.  This additional effort reduces 
performance considerably.

Add to this that you need (sadly) a copy of a buffer between the network 
stack and the disk stack.  RDMA reduces one of these copies, but but as 
far as I know, it doesn't talk directly to the disks (you can do 
something like this with SCST in the block modes if you don't mind iSCSI).

> Registration/deregistration issues are also well-known to the MPI
> developpers, and all this is certainly not that easy to manage in
> other areas.
> 
> Still, NFS-RDMA remains NFS. If the bottleneck is not in the
> transport, nothing will be improved by RDMA from the performance
> point of view. Even worse, what I saw with the 2.6.27 kernel +
> OFED1.4-rc3 is the inability of NFS-RDMA to match the performance of
> NFS-TCP for some patterns of IOzone, with a filesystem able to

Hmmm.... Most of the (default) IOzone measurements we have done (and 
seen published) are bound almost entirely by system ram cache.  Indeed, 
we have had to go into the code and alter some of the constants to allow 
us to test greater than 16 MB records, and greater than 16 GB files. 
Otherwise all we measure is cache speed.

Could you elaborate on system parameters, and what measurements weren't 
up to par, as well as what options you used?

We see NFSoverRDMA on SDR achieving about 400 MB/s while NFS over IPoIB 
on the same hardware (identical actually) is about 200 MB/s on reads. 
With DDR IB, we ran a test between a pair of our JackRabbit machines, 
and found a sustained ~500-550 MB/s read, and about 400 MB/s or so 
write.  The underlying file system could handle well over 1 GB/s.

NFS over IPoIB wasn't close.

> sustain itself several hundreds of MB/s (using exactly the same
> hardware and software in both cases). We are far from a pure IB
> bandwidth issue here, we are just facing an issue in how the requests
> are handled probably, perhaps when paging occurs, I can't tell. I

I don't think this is the limitation.  I think it is more along the 
lines of copying buffers between different stacks ... kernel buffer to 
user space program and then back to kernel for net->ram->disk and 
vice-versa.

There are other issues as well which could be causing performance 
degradation, specifically on payload size.

FWIW:  This is a 2.6.27.5 kernel.

> could not find any tuning to solve the more obvious problem, i.e. the
> low bandwidth for reading, except mounting with '-o rsize=4096';
> probably not what people expect, as this will have other effects.
> Anyway this does improve only the sequential read bandwidth. But of
> course I will repeat my tests with the latest release of everything
> when I have time, still making sure I compare apples to apples... 
> Again, I'm sure improvements are on their way !
> 
> Fred.
> 
> 
> -----Original Message----- From: Talpey, Thomas
> [mailto:Thomas.Talpey at netapp.com] Sent: Tuesday, 11 November, 2008
> 17:02 To: Ciesielski, Frederic (EMEA HPC&OSLO CC) Cc: Jeff Becker;
> general at lists.openfabrics.org Subject: RE: [ofa-general] NFS-RDMA
> (OFED1.4) with standard distributions ?
> 
> At 11:27 AM 11/10/2008, Ciesielski, Frederic (EMEA HPC&OSLO CC)
> wrote:
>> That's great, thanks.
>> 
>> I ran some tests with the 2.6.27 kernel as server and client, and 
>> basically it works fine.
>> 
>> I could not find yet any situation where NFS-RDMA would outperform 
>> NFS/IPoIB, at least when you compare apples to apples (same
>> clients, same server, same protocol, and not just write to/read
>> from the caches), and it even seems to have severe performance
>> issues for reading with files larger than the memory size of the
>> client and the server. Hopefully this will improve when more users
>> will be able to give valuable feedback...
> 
> I have a couple of questions, and perhaps suggestions as well. First
> the questions...
> 
> - Have you tried with a 2.6.28-rc4 client and server at all? There
> are a number of significant NFS/RDMA improvements queued in
> kernel.org, especially around RDMA memory registration as well as
> RDMA operation scheduling. We've seen some significant throughput
> improvement even for basic tunings.
> 
> - What type of storage are you using at the server, and have you
> attempted to tune the server at all? For example, if you are storage 
> (spindle) limited, no network tuning is likely to help and you should
> address that first. Also, there are tunings such as nfsd thread
> count, export options, and adapter choice that can make a large
> difference.
> 
> Bottom line, you should be able to reach multi-hundred-MB/sec of
> read/write throughput with NFS/RDMA, but there may be issues on
> specific systems, or perhaps with the OFED1.4 code, that need to be
> accounted for. If possible, you may want to set expectations based on
> mainline, then try to duplicate them in the OFED backport. The
> current OFED NFS/RDMA support is still evolving, while we consider
> the mainline kernel.org version to be rather solid.
> 
> Tom.
> 
>> Fred.
>> 
>> -----Original Message----- From: Jeff Becker
>> [mailto:Jeffrey.C.Becker at nasa.gov] Sent: Saturday, 08 November,
>> 2008 22:35 To: Ciesielski, Frederic (EMEA HPC&OSLO CC) Cc:
>> general at lists.openfabrics.org Subject: Re: [ofa-general] NFS-RDMA
>> (OFED1.4) with standard distributions ?
>> 
>> Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>>> Is there any chance that the new NFS-RDMA features coming with
>>> OFED 1.4 work with standard and current distributions, like
>>> RHEL5, SLES10 ?
>> Not yet, but I'm working on it. I intend for NFSRDMA to work on
>> 2.6.27 and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will
>> likely be done for OFED 1.4.1. Thanks.
>> 
>> -jeff
>> 
>>> Did anybody test this, or would pretend it is supposed to work ?
>>> 
>>> I mean without building a 2.6.27 or equivalent kernel on top of
>>> it, keeping almost full support from the vendors.
>>> 
>>> Enhanced kernel modules may not be sufficient to work around the 
>>> limitations of old kernels...
>>> 
>>> 
>>> 
> 
> _______________________________________________ general mailing list 
> general at lists.openfabrics.org 
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the general mailing list