[openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq

Mon Nov 6 09:11:21 PST 2006

> From: Todd Rimmer
> Sent: Thursday, November 02, 2006 7:15 PM
> To: 'Michael S. Tsirkin'; Hal Rosenstock
> Cc: Or Gerlitz; openib-general; Arlin R Davis
> Subject: RE: [openib-general] scaling issues, was: uDAPL cma: add
support
> for address and route retries, call disconnect when recving dreq
> 
> > From: Michael S. Tsirkin
> > Sent: Thursday, November 02, 2006 6:15 PM
> > To: Hal Rosenstock
> > Cc: Or Gerlitz; openib-general; Arlin R Davis
> > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add
> support
> > for address and route retries, call disconnect when recving dreq
> >
> > Quoting r. Hal Rosenstock <halr at voltaire.com>:
> > > Subject: Re: scaling issues, was: uDAPL cma: add support for
address
> and
> > route retries, call disconnect when recving dreq
> > >
> > > On Thu, 2006-11-02 at 17:54, Michael S. Tsirkin wrote:
> > > > Quoting r. Arlin Davis <ardavis at ichips.intel.com>:
> > > > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma:
add
> > support for address and route retries, call disconnect when recving
dreq
> > > > >
> > > > > Sean Hefty wrote:
> > > > >
> > > > > >One option is having the SA (or ib_umad?) return a busy
status in
> > response to a
> > > > > >MAD, but we'd still have to be able to send this response as
> > quickly as requests
> > > > > >are being received.  We could then limit the number of
requests
> > that would be
> > > > > >queued in the kernel for a user.
> > > > > >
> > > > > >
> > > > >
> > > > > Another great option would be to have path record caching.
> > Unfortunately
> > > > > OFED 1.1 did not include ib_local_sa in the release.
> > > > >
> > > >
> > > > This won't help you much.
> > > > With 256 nodes all to all already gives you 65000 requests
> > > > which is the same order of magnitude as the reported 130000.
> > >
> > > The requests might occur at a different time so they could be
spread
> out
> > > rather than synchronized.
> >
> > I don't see how caching does this.
> >
> If all the queries are made at app startup, there will be one huge
batch
> of queries to the SA, especially for a many process MPI job.
> 
> In contrast if SA caching is building its own replica of the relevant
> subset of the SA, the pace can be more controlled.  It can even be
> purposely randomized by the SA cache code itself (eg. don't just do it
> every 10 minutes, do it every 10 minutes +/- a random number, etc).
This
> way if all nodes powered on at similar time you won't have a pattern
of
> everyone asking SM at the same time.
> 
> Todd Rimmer

resending, bounced due to email address change.