[openib-general] user-mode verbs on Itanium
Michael S. Tsirkin
mst at mellanox.co.il
Sun May 8 05:49:16 PDT 2005
Quoting r. Grant Grundler <iod00d at hp.com>:
> Subject: Re: [openib-general] user-mode verbs on Itanium
>
> On Fri, May 06, 2005 at 12:09:47PM -0700, Grant Grundler wrote:
> > Since the diff is essentially the whole file for everything but
> > rdma_lat.c, I've parked the entire mess here:
> > http://gsyprf3.external.hp.com/openib/perftest-01.tgz
>
> perftest-02.tgz is now available.
>
> And rdma_lat.c has substantial changes:
> o -c option to display output in "raw" CPU cycles
> o -U option to dump unsorted data (MORE ON THIS BELOW)
I'll try to merge these two changes.
> o fixed bugs and typos noted by Bernhard Fischer <blist at aon.at>.
> o fixed a few more bugs I found.
Besides the new flags, I noticed the following changes:
- asm/timex.h usage to get cycles.
Unfortunately not portable to all platforms:
on x86_64 asm/timex.h includes linux/config so its not legal
for userspace to include. Please just implement get_cycles instead.
- get_cpu_mhz() instead of get_cpu_khz() - is that important?
Actually I planned to change get_cpu_khz() to cpu_khz to match
linux in-kernel interface. Does this make sense to you?
> I'm consistently getting 4.74 usec median latency with 2.6.11 kernel +
> svn r2229 on HP ZX1 platform (PCI-X) + 1.5Ghz/6M Madison processors.
Thats a bit higher latency than what I see with Intel, but at least
in the right ballbark.
> This seems kinda low so I'm not 100% comfortable with the
> measurements yet. Maybe adding MPI on top of this will add
> the 2-3 usec that I think is missing.
>
> [ If it's correct, I guess I can go back to working on some tg3
> driver tunes :^) Just kidding! It's late here... ]
>
> > BTW, Can I "leverage" code from ibv_pingpong to replace the issues
> > the following comment in rdma_lat.c refers to?
> > /*
> > * pp_get_local_lid() uses a pretty bogus method for finding the LID
> > * of a local port. Please don't copy this into your app (or if you
> > * do, please rip it out soon).
> > */
ibv_pingpong used to have the same comment too.
I'll go back and look at ibv_pingpong.
> This is still outstanding. But I'd like to first see perftest-02
> land in a sane place in openib.org Subversion tree.
I could just copy stuff to say userspace/perftests.
Is that OK with everyone?
> I can then submit patches against some stuff:
> o update the README with notes on how to use/interpret the data
> o stop replicating code and make subroutines
I considered this but I'm afraid that adding another layer on top of
libibverbs would deduct from readability, and cause people
to copy it wholesale, something which I would like to avoid.
> o split up main() into bite-sized chunks so people
> know which part is "initialization" and "run time".
I guess adding more comments like /* Initialization */ is the way to go.
> o stop pretending there is no global state and get major
> variables off the stack and into .bss or respective
> subroutines.
Why is that good? Passing flags in global parameters would
make the code less readable.
> (I still want to hack a bit on mthca_cq.c:cqe_sw() too)
>
>
> > Couldn't post send: scnt=1
>
> Another clue about this failure: I'm only seeing this if I specify "-s".
> 1,2,4,8,16,24,25,26,27,28 all worked. All the values > 29 failed.
> (I tried 29, 30, 31, 32, 33, 64, 65, 4096, 8192).
> I'm pretty busy next week...not sure I'll be able to
> track it down then.
I think thats because when I wrote the test qp attribute max_inline didnt
exist. I'll update the code.
>
> Why I like -U
> -------------
> (Sorry, bad pun. ggg hides :^)
:)
I'll merge the -U code, I see how it can be useful.
> "unsorted" output is useful to:
> o correlate client/server data (hiccups in algorithm/fabric)
> o recognize startup vs "steady state" behaviors.
> o recognize cyclical data (e.g. long runs where itimer interfers)
>
> The following is a discussion about the first two points.
> Extra credit question buried in the middle.
>
> The unsorted output in CPU cycles of "server":
> grundler at gsyprf3:/usr/src/trunk/contrib/mellanox/perftest$ ./rdma_lat -cU | head
> local address: LID 0x0b, QPN 0x320406, PSN 0xeb1f3e RKey 0x1f60032 VAddr 0x6000000000014001
> remote address: LID 0x10, QPN 0x1a0406, PSN 0x6d8a50, RKey 0x1040436 VAddr 0x6000000000014001
> #, cycles
> 1, 97779
> 2, 52050
> 3, 7028
> 4, 7318
> 5, 7201
> ...
>
> And the same from the "client":
> grundler at iota:/usr/src/trunk/contrib/mellanox/perftest$ ./rdma_lat -cU 10.0.0.51 | head
> local address: LID 0x10, QPN 0x1a0406, PSN 0x6d8a50 RKey 0x1040436 VAddr 0x6000000000014001
> remote address: LID 0x0b, QPN 0x320406, PSN 0xeb1f3e, RKey 0x1f60032 VAddr 0x6000000000014001
> #, cycles
> 1, 611
> 2, 93491
> 3, 27129
> 4, 7142
> 5, 7260
> ...
>
> The first sample (611) on the client is improbably small for a 1.5Ghz
> system (i.e ~0.4 usecs). That is a clue that delta[0] means
> something different than delta[{N>0}] on the client.
> It might be accurate if the client sends first and the next cycle count
> is taken right after telling the card data is ready to send.
> This could be useful data too.
> Michael (mst), is it obvious to you if I've understood that correctly?
> Either way, I'd like to leave the test as is and add the explanation
> to the README.
I'll look into this.
> The 2cd and 3rd sample sort-of match the first two samples from the
> server and are plausible (~63 usecs). The fact that client/server
> closely agree on the ~63 usec (93k cycles) sample is a good clue
> I can trust the measurement.
>
> That first sample on the server represents a tuning opportunity.
> A ~10x difference between startup and runtime is significant
> for short lived connections/regions.
On the other hand, it could be negligible compared to connection
setup time.
> It might be in the switch (TS90), HCA, or host SW stack.
> I don't know. I'd need a PCI-X bus traces, HW stats from
> the HCA or Switch to determine how much each contributes
> to the latency.
>
> I hope it's obvious now why -U is interesting.
>
> thanks,
> grant
>
OK, so I see 3 changes there:
1. support for -U and -c
I'll merge that
2. get_clock in assembly replaced with asm/timex.h:
Unfortunately asm/timex.h was never intended for userspace,
so this trick doesnt work on all platforms: specifically its
broken on ppc,i386 and x86_64, so ppc64 and ia64 where it does
work are kind of in the minority.
And given that even these may be broken on some distros,
I suggest we simply add an assembly implementation for now,
rather than add this dependency.
Grant, you mentioned some other fixes - what are they?
Maybe I'll notice them when I do the actual merge.
MST
--
MST - Michael S. Tsirkin
More information about the general
mailing list