[ewg] Infiniband and NFS speed tuning - any ideas?

Ross Smith myxiplx at googlemail.com
Wed Dec 16 00:19:26 PST 2009


Hi Jon,

I'm running NFS over TCP.  I will be looking at RDMA, but I've seen
900MB/s before just using TCP, and if I can stick with TCP it makes
the implementation a lot simpler.

I know I'm going to have to evaluate RDMA for the lower CPU usage, but
for our needs TCP makes enough sense that I'm not discounting it
entirely.

I've also done some more tests using qperf and that's showing that I'm
getting a bandwidth of 1.64GB/s through TCP, and a latency of 36.5us,
which should allow NFS to get much better results than this.

I'm going to try some more testing today, and will have a look at
iozone too so I can see if the bandwidth scales as I add more clients.

Ross



On Tue, Dec 15, 2009 at 4:21 PM, Jon Mason <jon at opengridcomputing.com> wrote:
> Are you running NFS RDMA or NFS TCP?  Have you tweaked the read/write
> size mount properties?
>
> Its a little stale, but you might want to read:
> http://nfs.sourceforge.net/nfs-howto/ar01s05.html
>
> On Tue, Dec 15, 2009 at 10:11:58AM +0000, Ross Smith wrote:
>> Hi folks,
>>
>> Can anybody give me advice on how to tune NFS to improve performance
>> on this system?  I've got a pair of 40Gb/s QDR ConnectX cards attached
>> to a 20Gb/s DDR switch.  Infiniband diagnostics show that I can
>> consistently achieve a bandwidth of 1866MB/s, but the best I've gotten
>> out of NFS in testing is 440MB/s and in actual use I'm hitting nearer
>> 290MB/s.
>>
>> To test performance I'm creating a ramdisk, mounting it over NFS and
>> doing a simple write of a 100MB file:
>>
>> The full setup is:
>>
>> NFS Server:  192.168.2.5
>> NFS Client:  192.168.2.2
>>
>> On server:
>> # mkdir ramdisk
>> # mount -t ramfs -o size=512m ramfs ./ramdisk
>> # chmod 777 ramdisk
>> # /usr/sbin/exportfs -o rw,insecure,async,fsid=0 :/home/ross/ramdisk
>> # /sbin/service nfs start
>> # ./fw.stop
>>
>> On client:
>> # opensm -B
>> # mkdir remote
>> # ./fw.stop
>> # mount 192.168.2.5:/home/ross/ramdisk ./remote
>>
>> The script I'm using to temporarily disable the firewall is:
>> # cat fw.stop
>> echo "stopping firewall"
>> iptables -F
>> iptables -X
>> iptables -t nat -F
>> iptables -t nat -X
>> iptables -t mangle -F
>> iptables -t mangle -X
>> iptables -P INPUT ACCEPT
>> iptables -P FORWARD ACCEPT
>> iptables -P OUTPUT ACCEPT
>>
>>
>> The bandwidth test:
>> # ib_send_bw 192.168.2.5
>> ------------------------------------------------------------------
>>                    Send BW Test
>> Connection type : RC
>> Inline data is used up to 1 bytes message
>>  local address:  LID 0x01, QPN 0x10004b, PSN 0x5812a
>>  remote address: LID 0x04, QPN 0x80049, PSN 0x7e71c2
>> Mtu : 2048
>> ------------------------------------------------------------------
>>  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
>>  65536        1000            1866.57               1866.22
>> ------------------------------------------------------------------
>>
>>
>> Ramdisk speed test results (before exporting the folder):
>> dd if=/dev/zero of=./100mb bs=1024k count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 0.049329 seconds, 2.1 GB
>>
>> And ramdisk results after exporting the folder (I'm not sure why this
>> should be so much slower, but this appears consistently reproducible)
>> # dd if=/dev/zero of=./100mb bs=1024k count=200
>> 200+0 records in
>> 200+0 records out
>> 209715200 bytes (210 MB) copied, 0.235899 seconds, 889 MB/s
>>
>>
>> I've checked that the client can cope with the speeds too, creating a
>> ramdisk there for testing:
>> # dd if=/dev/zero of=./100mb bs=1024k count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 0.060491 seconds, 1.7 GB/s
>>
>>
>> So I have an interconnect that can push 1.8GB/s, a server that can do
>> 2.1GB/s, and a client that can cope with 1.7GB/s.  I'm aiming for
>> 900MB/s+ over NFS, and in theory I have the infrastructure to cope
>> with that.
>>
>> However, NFS speed test results are about a third of the level I'm
>> after, no matter how I try to tweak the settings:
>>
>> dd if=/dev/zero of=./100mb bs=1024k count=100100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 0.313448 seconds, 335 MB/s
>>
>> Sync NFS results are truly horrible (even though this is to a ramdisk):
>> # dd if=/dev/zero of=./100mb bs=1024k count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 4.84575 seconds, 21.6 MB/s
>>
>> # dd if=/dev/zero of=./100mb2 bs=1024k count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 1.38643 seconds, 75.6 MB/s
>>
>> Going back to async, and tweaking the block size helps:
>> # dd if=/dev/zero of=./100mb3 bs=32k count=3200
>> 3200+0 records in
>> 3200+0 records out
>> 104857600 bytes (105 MB) copied, 0.358189 seconds, 293 MB/s
>>
>> [root at xenserver1 remote]# dd if=/dev/zero of=./100mb3 bs=32k count=3200
>> 3200+0 records in
>> 3200+0 records out
>> 104857600 bytes (105 MB) copied, 0.461682 seconds, 227 MB/s
>>
>> [root at xenserver1 remote]# dd if=/dev/zero of=./100mb3 bs=64k count=3200
>> 3200+0 records in
>> 3200+0 records out
>> 209715200 bytes (210 MB) copied, 3.23562 seconds, 64.8 MB/s
>>
>> [root at xenserver1 remote]# dd if=/dev/zero of=./100mb3 bs=16k count=3200
>> 3200+0 records in
>> 3200+0 records out
>> 52428800 bytes (52 MB) copied, 0.119123 seconds, 440 MB/s
>>
>> [root at xenserver1 remote]# dd if=/dev/zero of=./100mb3 bs=8k count=3200
>> 3200+0 records in
>> 3200+0 records out
>> 26214400 bytes (26 MB) copied, 0.069093 seconds, 379 MB/s
>>
>> It seems I'm getting the best performance from 16k blocks, but I
>> actually want to tune this for 32k blocks, and neither is really at a
>> level I'm happy with.
>>
>> Can anybody offer any suggestions on how to tune NFS and IPoIB to
>> improve these figures?
>>
>> thanks,
>>
>> Ross
>>
>> PS.  I should mention that I've seen 900MB/s over straight NFS before,
>> although that was on a smaller test network with just a couple of
>> 10Gb/s SDR Infiniband cards and an 8 port SDR switch.
>> _______________________________________________
>> ewg mailing list
>> ewg at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>



More information about the ewg mailing list