[Ofa_remotepm] Additional notes from remote PM thinktank

Paul Grun grun at cray.com
Thu Apr 12 01:08:33 PDT 2018


Responses to John's thoughts:

"Storage Model: The local case had file and volume models. These are generally understood. What are the correct models for the remote case and what manages them?"

I think the NVMP TWG is assuming that there are the same sets of models in the remote case, and that the file and block models are generally well understood.

"Power: I mentioned this Monday night, but it doesn't seem to be in the notes. For massive amounts of PM this becomes important for both TCO, density, and power-budget reasons. For example, my understanding is that Exascale RFPs have a desired power target and a hard cap. So for the checkpoint case, RPM helps with both BW and meeting the overall power budget. Not unique to the remote case, but it seems to be more likely in RPM scale-out solutions."

If I understand you, the point is that there is a significant system power savings by using RPM (vs e.g. DRAM).  If that's your point, I've tried to include it in the latest version of the slides (coming to an email reflector near you in a few minutes)

"Wear-leveling and error handling: is this different in the remote case?  Depends a lot on how it is done, which was out of scope in the local case, if I read correctly. However, there may be a requirement that a HCA informs both the local and remote sides of an operation on error."

I don't see wear leveling as a use case or API issue, but as an implementation issue.  Do you argee?  As for error handling, I think everyone agrees that 90% of the work is going to go into covering the error cases.  This is a big big big deal.  Did I miss your point?

-Paul




From: Ofa_remotepm [mailto:ofa_remotepm-bounces at lists.openfabrics.org] On Behalf Of Byrne, John (Labs)
Sent: Wednesday, April 11, 2018 6:12 PM
To: Voigt, Doug <doug.voigt at hpe.com>; ofa_remotepm at lists.openfabrics.org
Subject: Re: [Ofa_remotepm] Additional notes from remote PM thinktank

After looking at Doug's version of the slides, I have a couple of random thoughts:

Storage Model: The local case had file and volume models. These are generally understood. What are the correct models for the remote case and what manages them?

Power: I mentioned this Monday night, but it doesn't seem to be in the notes. For massive amounts of PM this becomes important for both TCO, density, and power-budget reasons. For example, my understanding is that Exascale RFPs have a desired power target and a hard cap. So for the checkpoint case, RPM helps with both BW and meeting the overall power budget. Not unique to the remote case, but it seems to be more likely in RPM scale-out solutions.

Wear-leveling and error handling: is this different in the remote case?  Depends a lot on how it is done, which was out of scope in the local case, if I read correctly. However, there may be a requirement that a HCA informs both the local and remote sides of an operation on error.

That's all for now.

John Byrne

From: Ofa_remotepm [mailto:ofa_remotepm-bounces at lists.openfabrics.org] On Behalf Of Voigt, Doug
Sent: Tuesday, April 10, 2018 1:09 PM
To: ofa_remotepm at lists.openfabrics.org<mailto:ofa_remotepm at lists.openfabrics.org>
Subject: [Ofa_remotepm] Additional notes from remote PM thinktank


I added slides 12 - 16 to the prior slide deck.  My notes focused on use case and gap enumeration.  There is some overlap with the other slides.

Doug

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofa_remotepm/attachments/20180412/a7ce3efd/attachment.html>


More information about the Ofa_remotepm mailing list