[openib-general] Re: Re: [PATCH] rdma_lat-09 and results

Michael S. Tsirkin mst at mellanox.co.il
Thu Jun 2 00:12:03 PDT 2005


Quoting r. Gleb Natapov <glebn at voltaire.com>:
> Subject: Re: Re: [PATCH] rdma_lat-09 and results
> 
> On Thu, Jun 02, 2005 at 08:29:34AM +0300, Michael S. Tsirkin wrote:
> > Quoting r. Grant Grundler <iod00d at hp.com>:
> > > Subject: [PATCH] rdma_lat-09 and results
> > > 
> > > Michael,
> > > 
> > > Good news:
> > > 	My next cleanup of rdma_lat.c is working and patch is appended.
> > > 	Summary of changes below.
> > > 
> > > Bad News:
> > > 	perf is about ~15 cycles slower since the last time I tested.
> > > 	(Hrm...maybe it's time to cycle power on the TS90 switch again.)
> > > 
> > > 
> > > Here's with the new rdma_lat.c:
> > > grundler at gsyprf3:/usr/src/openib_gen2/src/userspace/perftest$ ./rdma_lat  -C
> > >    local address: LID 0x27 QPN 0x80406 PSN 0x9188f7 RKey 0x300434 VAddr 0x6000000000014001
> > >   remote address: LID 0x25 QPN 0x70406 PSN 0x5d4824 RKey 0x2a0434 VAddr 0x6000000000014001
> > > Latency typical: 7140 cycles
> > > Latency best   : 6915 cycles
> > > Latency worst  : 52915.5 cycles
> > > grundler at gsyprf3:/usr/src/openib_gen2/src/userspace/perftest$ 
> > > 
> > > And the "client" side:
> > > grundler at iota:/usr/src/openib_gen2/src/userspace/perftest$ ./rdma_lat -C 10.0.0.51
> > >    local address: LID 0x25 QPN 0x70406 PSN 0x5d4824 RKey 0x2a0434 VAddr 0x6000000000014001
> > >   remote address: LID 0x27 QPN 0x80406 PSN 0x9188f7 RKey 0x300434 VAddr 0x6000000000014001
> > > Latency typical: 7140 cycles
> > > Latency best   : 6907 cycles
> > > Latency worst  : 94920 cycles
> > > 
> > > 
> > > The previous set of rdma_lat results are here:
> > >     http://openib.org/pipermail/openib-general/2005-May/006721.html
> > > 
> > > I'll guess the previous SVN verion was no older than r2229.
> > > 
> > > 
> > > I get 7140 to 7151 for the original rdma_lat.   Usually 7147.5.
> > > I get 7132 to 7155 with my version of rdma_lat. Usually 7140.
> > > No statistically significant differences.
> > > Both essentially agree on the higher result.
> > > Using "-n 10000" gave more consistent results *
> > 
> > I changed the timestamping strategy. I used to:
> > 
> > post
> > tstamp
> > poll
> > post
> > tstamp
> > poll
> > post
> > tstamp
> > poll
> > post
> > tstamp
> > poll
> > 
> > This meant that tstamp instruction was out of the data path,
> > while we did polling.
> > On the negative side, although the average (and likely median) delta
> > between tstamps was a reliable measurement of round trip time
> > (since there was one tstamp each roundtrip),
> > min/max values were not measuring anything reliably: if I start polling
> > late, two tstamps can be closer than what the wire allows for.
> > 
> > So I changed that to:
> > 
> > 
> > tstamp
> > post
> > poll
> > tstamp
> > post
> > poll
> > tstamp
> > post
> > poll
> > tstamp
> > post
> > poll
> > 
> > And now, on the plus side, the mix/max deltas are actually pessimistic about
> > roundtrip times, on the minus side, we are calling tstamp on detapath,
> > slowing it down. ~15 cycles is a bit high: of course tstamp
> > needs to prevent instructions from being reordered across it, and so
> > it should take on the order of the pipeline depth to perform, but then
> > maybe its a microcode thing.
> >
> This is what I recently saw on linux-kernel:
> 
> ### begin quote of Andi Kleen
> 
> > RDTSC on older Intel CPUs takes something like 6 cycles. On P4's it
> > takes much more, since it's decoded to a microcode MSR access.
> 
> It actually seems to flush the trace cache, because Intel figured
> out that out of order RDTSC is probably not too useful (which is right) 
> and the only way to ensure that on Netburst seems to stop the trace
> cache in its track. That can be pretty slow, we're talking 1000+ cycles
> here.
> ### end quote

OK, but I imagine thats probably a worst case.
I certainly see nothing like a 1 usec here, and my systems
are P4 based Xeons.

> > I'm not against going back to the previous measurement, but we'd have to
> > give up the min/max reporting since its an artefact.
> > What do you say?

-- 
MST



More information about the general mailing list