[ofa-general] IB performance stats (revisited)
Hal Rosenstock
halr at voltaire.com
Wed Jul 11 08:16:26 PDT 2007
On Wed, 2007-07-11 at 11:00, Mark Seger wrote:
> Hal Rosenstock wrote:
>
> >On Wed, 2007-07-11 at 10:15, Mark Seger wrote:
> >
> >
> >>My basic philosophy, and I suspect there are those who might disagree,
> >>is that you can't use the network to monitor the network, at least not
> >>in times of trouble.
> >>
> >>
> >
> >Right, in times of certain troubles.
> >
> >
> and that is the key. since you can't know apriori when you're about to
> have troubles, you need to be collecting the data locally before they occur.
>
> >>That's why I insist on having to query the HCAs
> >>directly since I can't always be sure the network is there and/or
> >>reliable. If you are willing to concede that this can indeed happen
> >>than the question becomes one of how do you reliably get data from an
> >>HCA and that's the basis for my (re)starting this discussion.
> >>
> >>
> >
> >The reliability comes from timeout/retry mechanisms. If performance data
> >cannot be obtained on an IB network, it needs to be trouble shooted at a
> >lower level (by SMPs).
> >
> >In any case, a rearchitecture of the PMA was proposed and seems
> >reasonable to me in that it can accomodate either approach. All that is
> >needed now is for someone to step up and champion an implementation of
> >this. Unfortunately, I do not have time to do so.
> >
> >
> I don't know if what I've been proposing requires any rearchitecting as
> I see is as something local to each node.
There was some rearchitecting to make it meet the needs to what you have
proposed in addition to that of the IB performance manager. I think
Jason had a good proposal for this.
-- Hal
> Specificially, and there is
> already an implementation of this in an earlier voltaire stack, is to
> export wrapping HCA counters to /proc. The module that does this
> read/clears the counters on every access but since no local applications
> are accessing the counters directly, clearing them doesn't hurt anyone.
> Alas, anyone else who wants to query the counters will find them reset.
>
> The other side benefit of exporting these counters is such a way is now
> lots of others can collect/report this info. In other words is someone
> chose to add IB stats to sar, it would become very easy to do!
>
> If this is the type of thing people are interested in, I might be able
> to supply some code to do it.
>
> >>As for querying the switch for counters, what do you do on a very large
> >>network, say 10s of thousands of nodes if you want to get performance
> >>data every second? I also realize this is an extreme situation today
> >>(the node count not the frequency of monitoring) but I'm sure everyone
> >>would agree systems of these sizes are not that far off.
> >>
> >>
> >
> >You have a distributed performance manager to handle this. A hierarchy
> >of performance managers has been discussed on the list before.
> >
> >
> ahh, I see.
> -mark
>
>
More information about the general
mailing list