[openib-general] user-mode verbs on Itanium

Fri May 6 22:37:06 PDT 2005

On Fri, May 06, 2005 at 12:09:47PM -0700, Grant Grundler wrote:
> Since the diff is essentially the whole file for everything but
> rdma_lat.c, I've parked the entire mess here:
> 	http://gsyprf3.external.hp.com/openib/perftest-01.tgz

perftest-02.tgz is now available.
And rdma_lat.c has substantial changes:
	o -c option to display output in "raw" CPU cycles
	o -U option to dump unsorted data (MORE ON THIS BELOW)
	o fixed bugs and typos noted by Bernhard Fischer <blist at aon.at>.
	o fixed a few more bugs I found.

I'm consistently getting 4.74 usec median latency with 2.6.11 kernel +
svn r2229 on HP ZX1 platform (PCI-X) + 1.5Ghz/6M Madison processors.

This seems kinda low so I'm not 100% comfortable with the
measurements yet. Maybe adding MPI on top of this will add
the 2-3 usec that I think is missing.

[ If it's correct, I guess I can go back to working on some tg3
  driver tunes :^) Just kidding! It's late here...  ]

> BTW, Can I "leverage" code from ibv_pingpong to replace the issues
> the following comment in rdma_lat.c refers to?
> /*
>  * pp_get_local_lid() uses a pretty bogus method for finding the LID
>  * of a local port.  Please don't copy this into your app (or if you
>  * do, please rip it out soon).
>  */

This is still outstanding. But I'd like to first see perftest-02
land in a sane place in openib.org Subversion tree.
I can then submit patches against some stuff:
	o update the README with notes on how to use/interpret the data
	o stop replicating code and make subroutines
	o split up main() into bite-sized chunks so people
	  know which part is "initialization" and "run time".
	o stop pretending there is no global state and get major
	  variables off the stack and into .bss or respective
	  subroutines.

(I still want to hack a bit on mthca_cq.c:cqe_sw() too)

> Couldn't post send: scnt=1

Another clue about this failure: I'm only seeing this if I specify "-s".
1,2,4,8,16,24,25,26,27,28 all worked.  All the values > 29 failed.
(I tried 29, 30, 31, 32, 33, 64, 65, 4096, 8192).
I'm pretty busy next week...not sure I'll be able to
track it down then.

Why I like -U
-------------
(Sorry, bad pun. ggg hides :^)

"unsorted" output is useful to:
	o correlate client/server data (hiccups in algorithm/fabric)
	o recognize startup vs "steady state" behaviors.
	o recognize cyclical data (e.g. long runs where itimer interfers)

The following is a discussion about the first two points.
Extra credit question buried in the middle.

The unsorted output in CPU cycles of "server":
grundler at gsyprf3:/usr/src/trunk/contrib/mellanox/perftest$ ./rdma_lat -cU | head
  local address:  LID 0x0b, QPN 0x320406, PSN 0xeb1f3e RKey 0x1f60032 VAddr 0x6000000000014001
  remote address: LID 0x10, QPN 0x1a0406, PSN 0x6d8a50, RKey 0x1040436 VAddr 0x6000000000014001
#, cycles
1, 97779
2, 52050
3, 7028
4, 7318
5, 7201
...

And the same from the "client":
grundler at iota:/usr/src/trunk/contrib/mellanox/perftest$ ./rdma_lat -cU 10.0.0.51 | head
  local address:  LID 0x10, QPN 0x1a0406, PSN 0x6d8a50 RKey 0x1040436 VAddr 0x6000000000014001
  remote address: LID 0x0b, QPN 0x320406, PSN 0xeb1f3e, RKey 0x1f60032 VAddr 0x6000000000014001
#, cycles
1, 611
2, 93491
3, 27129
4, 7142
5, 7260
...

The first sample (611) on the client is improbably small for a 1.5Ghz
system (i.e ~0.4 usecs).  That is a clue that delta[0] means
something different than delta[{N>0}] on the client.
It might be accurate if the client sends first and the next cycle count
is taken right after telling the card data is ready to send.
This could be useful data too.

Michael (mst), is it obvious to you if I've understood that correctly?
Either way, I'd like to leave the test as is and add the explanation
to the README.

The 2cd and 3rd sample sort-of match the first two samples from the
server and are plausible (~63 usecs).  The fact that client/server
closely agree on the ~63 usec (93k cycles) sample is a good clue
I can trust the measurement.

That first sample on the server represents a tuning opportunity.
A ~10x difference between startup and runtime is significant
for short lived connections/regions.
It might be in the switch (TS90), HCA, or host SW stack.
I don't know. I'd need a PCI-X bus traces, HW stats from
the HCA or Switch to determine how much each contributes
to the latency.

I hope it's obvious now why -U is interesting.

thanks,
grant