[Users] Finding amount of pinned memory and regions

Sasso, John (GE Power & Water, Non-GE) John1.Sasso at ge.com
Tue Oct 27 12:08:05 PDT 2015


Pardon if this has been addressed already, but I could not find the answer after doing Google searches.

We are in the process of analyzing and troubleshooting MPI jobs of increasingly large scale (OpenMPI 1.6.5) which communicate over a Mellanox-based IB fabric.  At a sufficiently large scale (# cores) a job will end up failing with errors similar to:

[yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb] error in endpoint reply start connect
[xxxxx:29318] 853 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed

So I know we are running into some memory limitation (educated guess) when queue pairs are being created to support such a huge mesh.  We are now investigating using the XRC transport to decrease memory consumption.

Anyways, my questions are:


1.       How do we determine HOW MUCH memory is being pinned by an MPI job on a node?  (If pmap, what exactly are we looking for?)

2.       How do we determine WHERE these pinned memory regions are?

We are running RedHat 6.x.  I tried posing this question on the OpenMPI mailing list but got no response, perhaps because it was considered more central to IB.

--john

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20151027/eb849bed/attachment.html>


More information about the Users mailing list