[openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq

Todd Rimmer todd.rimmer at qlogic.com
Thu Nov 2 16:15:28 PST 2006


> From: Michael S. Tsirkin
> Sent: Thursday, November 02, 2006 6:15 PM
> To: Hal Rosenstock
> Cc: Or Gerlitz; openib-general; Arlin R Davis
> Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add
support
> for address and route retries, call disconnect when recving dreq
> 
> Quoting r. Hal Rosenstock <halr at voltaire.com>:
> > Subject: Re: scaling issues, was: uDAPL cma: add support for address
and
> route retries, call disconnect when recving dreq
> >
> > On Thu, 2006-11-02 at 17:54, Michael S. Tsirkin wrote:
> > > Quoting r. Arlin Davis <ardavis at ichips.intel.com>:
> > > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma:
add
> support for address and route retries, call disconnect when recving
dreq
> > > >
> > > > Sean Hefty wrote:
> > > >
> > > > >One option is having the SA (or ib_umad?) return a busy status
in
> response to a
> > > > >MAD, but we'd still have to be able to send this response as
> quickly as requests
> > > > >are being received.  We could then limit the number of requests
> that would be
> > > > >queued in the kernel for a user.
> > > > >
> > > > >
> > > >
> > > > Another great option would be to have path record caching.
> Unfortunately
> > > > OFED 1.1 did not include ib_local_sa in the release.
> > > >
> > >
> > > This won't help you much.
> > > With 256 nodes all to all already gives you 65000 requests
> > > which is the same order of magnitude as the reported 130000.
> >
> > The requests might occur at a different time so they could be spread
out
> > rather than synchronized.
> 
> I don't see how caching does this.
> 
If all the queries are made at app startup, there will be one huge batch
of queries to the SA, especially for a many process MPI job.

In contrast if SA caching is building its own replica of the relevant
subset of the SA, the pace can be more controlled.  It can even be
purposely randomized by the SA cache code itself (eg. don't just do it
every 10 minutes, do it every 10 minutes +/- a random number, etc).
This way if all nodes powered on at similar time you won't have a
pattern of everyone asking SM at the same time.

Todd Rimmer




More information about the general mailing list