[openib-general] Re: Re: Userspace testing results (manykernels, many svn trees)

Wed Jan 25 10:00:45 PST 2006

On 25.01.2006 [08:17:29 +0200], Michael S. Tsirkin wrote:
> Quoting r. Nishanth Aravamudan <nacc at us.ibm.com>:
> > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees)
> > 
> > On 24.01.2006 [23:19:52 +0200], Michael S. Tsirkin wrote:
> > > Quoting r. Nishanth Aravamudan <nacc at us.ibm.com>:
> > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees)
> > > > 
> > > > On 24.01.2006 [21:39:23 +0200], Michael S. Tsirkin wrote:
> > > > > Quoting r. Roland Dreier <rdreier at cisco.com>:
> > > > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees)
> > > > > > 
> > > > > >     Michael> 1 sec = 5.37731e+14 usec
> > > > > > 
> > > > > >     Michael> which seems to indicate something's still wrong.
> > > > > > 
> > > > > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the
> > > > > > problem is probably still using long long to return the result of
> > > > > > mftb (which will result in shifting the result by 32 bits, ie
> > > > > > multiplying by 2^32).
> > > > > 
> > > > > Hmm.
> > > > > Maybe make clean wasnt run after updating?
> > > > > Could it be un on rev 5174?
> > > > 
> > > > Heh, here's what happens with 5174:
> > > > 
> > > > Correlation coefficient r^2: 0.773428 < 0.9
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 1 sec = inf usec
> > > > 
> > > > And so forth...
> > > > 
> > > > Thanks,
> > > > Nish
> > > 
> > > Hmm. Looks like mftb is returning wrong data.
> > > Could you uncomment lines setting DEBUG and DEBUG_DATA at the top?
> > > This will print all mftb values out.
> > 
> > Here you go:
> > 

<snip>

> > x=1990 y=397692
> > x=2000 y=399776
> > x=2010 y=401853
> > x=2020 y=403711
> > x=2030 y=405478
> > x=2040 y=407577
> > x=2050 y=409618
> > x=2060 y=411603
> > x=2070 y=413642
> > x=2080 y=415601
> > x=2090 y=417823
> > a = -8.02523
> > b = 199.818
> > a / b = -0.0401626
> > r^2 = 0.999999
> > Warning: measured timestamp frequency 199.818 differs from nominal 1600 MHz
> > 1 sec = 1.00195e+06 usec
> > 1 sec = 1.00198e+06 usec
> > 1 sec = 1.00207e+06 usec
> > 1 sec = 1.00207e+06 usec
> > 1 sec = 1.00207e+06 usec
> > 1 sec = 1.00207e+06 usec
> > 1 sec = 1.00207e+06 usec
> > 1 sec = 1.00207e+06 usec
> > 1 sec = 1.00207e+06 usec
> > 1 sec = 1.00207e+06 usec
> 
> Seems to work fine now ... what changed?
> Time to try rdma_lat/rdma_bw I guess.

I think rdma_lat and rdma_bw are fixed now, magically. The first job of
the day hasn't finished, but I checked the unformatted logs and it seems
to give the following:

rdma_lat:
Warning: measured timestamp frequency 199.838 differs from nominal 1600 MHz
loading libehca   local address: LID 0x0d QPN 0x140406 PSN 0xee1d06 RKey 0x2340032 VAddr 0x0000001001a001
  remote address: LID 0x08 QPN 0x140406 PSN 0x790ae8 RKey 0x2340032 VAddr 0x0000001001a001
<snip all the values>
Latency typical: 6.10244 usec
Latency best   : 6.00736 usec
Latency worst  : 71.9282 usec

rdma_bw:

Warning: measured timestamp frequency 199.82 differs from nominal 1600 MHz
loading libehca  local address:  LID 0x0d, QPN 0x150406, PSN 0x7cca90 RKey 0x23a0032 VAddr 0x000000f7fce000
  remote address: LID 0x08, QPN 0x150406, PSN 0x35668f, RKey 0x23a0032   VAddr 0x000000f7fb8000
  Bandwidth peak (#0 to #963): 233.043 MB/sec
  Bandwidth average: 233.041 MB/sec
  Service Demand peak (#0 to #963): 837 cycles/KB
  Service Demand Avg  : 50 cycles/KB

Thanks for the debugging,

Nish