[openib-general] [PATCH] FMR support in mthca

Fri Apr 1 14:56:00 PST 2005

On Thu, 2005-03-31 at 16:41, Libor Michalek wrote: 
> On Thu, Mar 31, 2005 at 04:25:28PM -0500, Hal Rosenstock wrote:
> > On Wed, 2005-03-30 at 19:43, Libor Michalek wrote:
> > > The program has a decent help for available parameters, but here are
> > > some reasonable defaults:
> > > 
> > >   server:
> > > 
> > >     ./ttcp.aio.x -r -l 65536 -a 20
> > > 
> > >   client:
> > > 
> > >     ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.0.100
> > 
> > Are these the parameters used to achieve the throughput numbers you
> > published ?
> > 
> > Sounds like you tweaked the numbers in sdp_dev.h. Anywhere else ?
> > 
> > Can you provide the tuning numbers used and where they were found so these
> > results can be reproduced ?
> 
>   No tweaking or changes to the SDP code itself. The parameters above 
> should give similar results, but here are the exact parameters I used
> for the two aync tests I mentioned in the original results I posted.
> 
> > > For async socket I kept 20 96K buffers in flight. For the FMR pool cache 
> > > hit async results I used only 20 different buffers. 
> 
>      ./ttcp.aio.x -r -l 98304 -a 20 -f M
>      ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -f M 192.168.0.100
> 
> > > For the FMR pool cache miss async results I used 1000 different
> > > buffers, of which only 20 were in flight at a time.
> 
>      ./ttcp.aio.x -r -l 98304 -a 20 -x 1000 -f M
>      ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -x 1000 -f M 192.168.0.100

We are seeing issues with both buffer size and iterations. We get back
-ENOMEM and also see VMA lock errors. Are the 2 related ? Should we turn
on SDP debug to see what specifically can't be allocated ? In that case,
what could be done ?

When using the default parameters, we see the following:

On the server:

[root at openib1 ~]# ./ttcp.aio.x -r -l 65536 -a 20
ttcp-r: buflen = 65536 nbuf = 0 align = 16384/0 port = 5001
ttcp-r: socket
ttcp-r: accept from 192.168.1.4
ttcp-r: Event error <-12> <5275648>
ttcp-r: 0 bytes in 0.00 real seconds = 0.00 Mbit/sec +++
ttcp-r: 2 I/O calls, usec/call = 112.00, calls/sec = 8928.57
ttcp-r: user: 0 sys: 0 total: 0 real: 224 (microseconds)
[root at openib1 ~]#

On the client:

[root at openib2 ~]# ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.1.3
ttcp-t: buflen = 65536 nbuf = 100000 align = 16384/0 port = 5001
192.168.1.3
ttcp-t: socket
ttcp-t: connect
ttcp-t: Event error <-12> <5275648>
ttcp-t: 0 bytes in 0.00 real seconds = 0.00 Mbit/sec +++
ttcp-t: 2 I/O calls, usec/call = 83.00, calls/sec = 12048.19
ttcp-t: user: 0 sys: 0 total: 0 real: 166 (microseconds)
[root at openib2 ~]#

Here's the output from the dmesg on the server:

 ERR: : VMA lock <620000:65536> error <-12> <16:0:8>
 ERR: : VMA lock <634000:65536> error <-12> <16:0:8>
 ERR: : VMA lock <648000:65536> error <-12> <16:0:8>
...<repeats>...

Here's the output from the dmesg (client):

 ERR: : VMA lock <580000:65536> error <-12> <16:0:8>
 ERR: : VMA lock <594000:65536> error <-12> <16:0:8>
 ERR: : VMA lock <5a8000:65536> error <-12> <16:0:8>
...<repeats>...

If the value of -l (length of network read/write buffers) it runs (up to
buffer size of 4K). However, there still is dmesg output on the server
side:

Here's the output from the dmesg on the server:

 ERR: : VMA lock <550000:1024> error <-12> <1:8:8>
 ERR: : VMA lock <554000:1024> error <-12> <1:8:8>
 ERR: : VMA lock <558000:1024> error <-12> <1:8:8>
WARN: : Cancel read with no IOCB. <2:0:00000005>
WARN: : Cancel read with no IOCB. <2:0:00000005>
 ERR: : VMA lock <528000:1024> error <-12> <1:8:8>
 ERR: : VMA lock <52c000:1024> error <-12> <1:8:8>
...<repeats>...

Is this related to system configuration somehow ? How much system memory
in your machines ? Is this a factor ?

Thanks.

-- Hal