[ofa-general] IB performance stats (revisited)
Hal Rosenstock
halr at voltaire.com
Wed Jul 11 09:21:51 PDT 2007
On Wed, 2007-07-11 at 11:00, Mark Seger wrote:
> Hal Rosenstock wrote:
>
> >On Wed, 2007-07-11 at 10:15, Mark Seger wrote:
> >
> >
> >>My basic philosophy, and I suspect there are those who might disagree,
> >>is that you can't use the network to monitor the network, at least not
> >>in times of trouble.
> >>
> >>
> >
> >Right, in times of certain troubles.
> >
> >
> and that is the key. since you can't know apriori when you're about to
> have troubles, you need to be collecting the data locally before they occur.
>
> >>That's why I insist on having to query the HCAs
> >>directly since I can't always be sure the network is there and/or
> >>reliable. If you are willing to concede that this can indeed happen
> >>than the question becomes one of how do you reliably get data from an
> >>HCA and that's the basis for my (re)starting this discussion.
> >>
> >>
> >
> >The reliability comes from timeout/retry mechanisms. If performance data
> >cannot be obtained on an IB network, it needs to be trouble shooted at a
> >lower level (by SMPs).
> >
> >In any case, a rearchitecture of the PMA was proposed and seems
> >reasonable to me in that it can accomodate either approach. All that is
> >needed now is for someone to step up and champion an implementation of
> >this. Unfortunately, I do not have time to do so.
> >
> >
> I don't know if what I've been proposing requires any rearchitecting as
> I see is as something local to each node. Specificially, and there is
> already an implementation of this in an earlier voltaire stack, is to
> export wrapping HCA counters to /proc. The module that does this
> read/clears the counters on every access but since no local applications
> are accessing the counters directly, clearing them doesn't hurt anyone.
> Alas, anyone else who wants to query the counters will find them reset.
No local application but perhaps a remote one. This is the reason for
the proposed rearchitecture (along with synthesizing the wider
counters).
-- Hal
> The other side benefit of exporting these counters is such a way is now
> lots of others can collect/report this info. In other words is someone
> chose to add IB stats to sar, it would become very easy to do!
>
> If this is the type of thing people are interested in, I might be able
> to supply some code to do it.
>
> >>As for querying the switch for counters, what do you do on a very large
> >>network, say 10s of thousands of nodes if you want to get performance
> >>data every second? I also realize this is an extreme situation today
> >>(the node count not the frequency of monitoring) but I'm sure everyone
> >>would agree systems of these sizes are not that far off.
> >>
> >>
> >
> >You have a distributed performance manager to handle this. A hierarchy
> >of performance managers has been discussed on the list before.
> >
> >
> ahh, I see.
> -mark
>
>
More information about the general
mailing list