[Users] Finding amount of pinned memory and regions
Sasso, John (GE Power & Water, Non-GE)
John1.Sasso at ge.com
Tue Oct 27 12:08:05 PDT 2015
Pardon if this has been addressed already, but I could not find the answer after doing Google searches.
We are in the process of analyzing and troubleshooting MPI jobs of increasingly large scale (OpenMPI 1.6.5) which communicate over a Mellanox-based IB fabric. At a sufficiently large scale (# cores) a job will end up failing with errors similar to:
[yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb] error in endpoint reply start connect
[xxxxx:29318] 853 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed
So I know we are running into some memory limitation (educated guess) when queue pairs are being created to support such a huge mesh. We are now investigating using the XRC transport to decrease memory consumption.
Anyways, my questions are:
1. How do we determine HOW MUCH memory is being pinned by an MPI job on a node? (If pmap, what exactly are we looking for?)
2. How do we determine WHERE these pinned memory regions are?
We are running RedHat 6.x. I tried posing this question on the OpenMPI mailing list but got no response, perhaps because it was considered more central to IB.
--john
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20151027/eb849bed/attachment.html>
More information about the Users
mailing list