[openib-general] user-mode verbs on Itanium

Michael S. Tsirkin mst at mellanox.co.il
Sun May 8 05:49:16 PDT 2005


Quoting r. Grant Grundler <iod00d at hp.com>:
> Subject: Re: [openib-general] user-mode verbs on Itanium
> 
> On Fri, May 06, 2005 at 12:09:47PM -0700, Grant Grundler wrote:
> > Since the diff is essentially the whole file for everything but
> > rdma_lat.c, I've parked the entire mess here:
> > 	http://gsyprf3.external.hp.com/openib/perftest-01.tgz
> 
> perftest-02.tgz is now available.
>
> And rdma_lat.c has substantial changes:
> 	o -c option to display output in "raw" CPU cycles
> 	o -U option to dump unsorted data (MORE ON THIS BELOW)

I'll try to merge these two changes.

> 	o fixed bugs and typos noted by Bernhard Fischer <blist at aon.at>.
> 	o fixed a few more bugs I found.

Besides the new flags, I noticed the following changes:
- asm/timex.h usage to get cycles.
  Unfortunately not portable to all platforms:
  on x86_64 asm/timex.h includes linux/config so its not legal
  for userspace to include. Please just implement get_cycles instead.

- get_cpu_mhz() instead of get_cpu_khz() - is that important?
  Actually I planned to change get_cpu_khz() to cpu_khz to match
  linux in-kernel interface. Does this make sense to you?
 
> I'm consistently getting 4.74 usec median latency with 2.6.11 kernel +
> svn r2229 on HP ZX1 platform (PCI-X) + 1.5Ghz/6M Madison processors.

Thats a bit higher latency than what I see with Intel, but at least
in the right ballbark.

> This seems kinda low so I'm not 100% comfortable with the
> measurements yet. Maybe adding MPI on top of this will add
> the 2-3 usec that I think is missing.
> 
> [ If it's correct, I guess I can go back to working on some tg3
>   driver tunes :^) Just kidding! It's late here...  ]
> 
> > BTW, Can I "leverage" code from ibv_pingpong to replace the issues
> > the following comment in rdma_lat.c refers to?
> > /*
> >  * pp_get_local_lid() uses a pretty bogus method for finding the LID
> >  * of a local port.  Please don't copy this into your app (or if you
> >  * do, please rip it out soon).
> >  */

ibv_pingpong used to have the same comment too.
I'll go back and look at ibv_pingpong.

> This is still outstanding. But I'd like to first see perftest-02
> land in a sane place in openib.org Subversion tree.

I could just copy stuff to say userspace/perftests.
Is that OK with everyone?


> I can then submit patches against some stuff:
> 	o update the README with notes on how to use/interpret the data
> 	o stop replicating code and make subroutines

I considered this but I'm afraid that adding another layer on top of
libibverbs would deduct from readability, and cause people
to copy it wholesale, something which I would like to avoid.

> 	o split up main() into bite-sized chunks so people
> 	  know which part is "initialization" and "run time".

I guess adding more comments like /* Initialization */ is the way to go.

> 	o stop pretending there is no global state and get major
> 	  variables off the stack and into .bss or respective
> 	  subroutines.

Why is that good? Passing flags in global parameters would
make the code less readable.
 
> (I still want to hack a bit on mthca_cq.c:cqe_sw() too)
> 
> 
> > Couldn't post send: scnt=1
> 
> Another clue about this failure: I'm only seeing this if I specify "-s".
> 1,2,4,8,16,24,25,26,27,28 all worked.  All the values > 29 failed.
> (I tried 29, 30, 31, 32, 33, 64, 65, 4096, 8192).
> I'm pretty busy next week...not sure I'll be able to
> track it down then.

I think thats because when I wrote the test qp attribute max_inline didnt
exist. I'll update the code.

> 
> Why I like -U
> -------------
> (Sorry, bad pun. ggg hides :^)

:)
I'll merge the -U code, I see how it can be useful.

> "unsorted" output is useful to:
> 	o correlate client/server data (hiccups in algorithm/fabric)
> 	o recognize startup vs "steady state" behaviors.
> 	o recognize cyclical data (e.g. long runs where itimer interfers)
> 
> The following is a discussion about the first two points.
> Extra credit question buried in the middle.
> 
> The unsorted output in CPU cycles of "server":
> grundler at gsyprf3:/usr/src/trunk/contrib/mellanox/perftest$ ./rdma_lat -cU | head
>   local address:  LID 0x0b, QPN 0x320406, PSN 0xeb1f3e RKey 0x1f60032 VAddr 0x6000000000014001
>   remote address: LID 0x10, QPN 0x1a0406, PSN 0x6d8a50, RKey 0x1040436 VAddr 0x6000000000014001
> #, cycles
> 1, 97779
> 2, 52050
> 3, 7028
> 4, 7318
> 5, 7201
> ...
> 
> And the same from the "client":
> grundler at iota:/usr/src/trunk/contrib/mellanox/perftest$ ./rdma_lat -cU 10.0.0.51 | head
>   local address:  LID 0x10, QPN 0x1a0406, PSN 0x6d8a50 RKey 0x1040436 VAddr 0x6000000000014001
>   remote address: LID 0x0b, QPN 0x320406, PSN 0xeb1f3e, RKey 0x1f60032 VAddr 0x6000000000014001
> #, cycles
> 1, 611
> 2, 93491
> 3, 27129
> 4, 7142
> 5, 7260
> ...
> 
> The first sample (611) on the client is improbably small for a 1.5Ghz
> system (i.e ~0.4 usecs).  That is a clue that delta[0] means
> something different than delta[{N>0}] on the client.
> It might be accurate if the client sends first and the next cycle count
> is taken right after telling the card data is ready to send.
> This could be useful data too.



> Michael (mst), is it obvious to you if I've understood that correctly?
> Either way, I'd like to leave the test as is and add the explanation
> to the README.

I'll look into this.

> The 2cd and 3rd sample sort-of match the first two samples from the
> server and are plausible (~63 usecs).  The fact that client/server
> closely agree on the ~63 usec (93k cycles) sample is a good clue
> I can trust the measurement.
> 
> That first sample on the server represents a tuning opportunity.
> A ~10x difference between startup and runtime is significant
> for short lived connections/regions.

On the other hand, it could be negligible compared to connection
setup time.

> It might be in the switch (TS90), HCA, or host SW stack.
> I don't know. I'd need a PCI-X bus traces, HW stats from
> the HCA or Switch to determine how much each contributes
> to the latency.
> 
> I hope it's obvious now why -U is interesting.
> 
> thanks,
> grant
> 

OK, so I see 3 changes there:

1. support for -U and -c
   I'll merge that

2. get_clock in assembly replaced with asm/timex.h:
   Unfortunately asm/timex.h was never intended for userspace,
   so this trick doesnt work on all platforms: specifically its
   broken on ppc,i386 and x86_64, so ppc64 and ia64 where it does
   work are kind of in the minority.

   And given that even these may be broken on some distros,
   I suggest we simply add an assembly implementation for now,
   rather than add this dependency.

Grant, you mentioned some other fixes - what are they?
Maybe I'll notice them when I do the actual merge.

MST

-- 
MST - Michael S. Tsirkin



More information about the general mailing list