[openib-general] openib gen2 architecture

shaharf shaharf at voltaire.com
Thu Nov 18 08:14:47 PST 2004


Hi all,

 

            I know I am new to this project and I must be naïve but I want to understand few things concerning the openib architecture. In the course of learning the openib gen 2 stack and preparing to port the opensm to it (which is my current task), I have encountered few areas that seem problematic to me and I would like to understand the reasoning for it, if not to offer alternatives. I am sorry that I rise these issues so late, but I was not involved in this project earlier. I hope it is better late than never.

 

            It seems to me that the major design approach is to do everything in the kernel but let user mode staff access to the lower levels to enable performance sensitive applications override all kernel layers. Am I right?

 

            It seems also that within the kernel, the ib interface/verbs (ib_*) is very close to the mthca verbs that are very close to vapi. I know that this is the way most of the industry was working, but I wonder - is this the correct model? Will this not pollute the kernel with a lot of IB specific stuff? Personally, I think that IB verbs (vapi) are so complicated that another level of abstraction is required. PDs, MRs, QPs QP state machine, PKEYs, MLIDs and other "curses", why should a module such as IPoIB knows about it? If the answer is performance then I have to disagree. In the same fashion you can say that in order to achieve efficient disk IO applications should know the disks geometry and to able to do direct IO to the disk firmware, or that applications should talk SCSI verbs to optimize their data transfers.

 

            It seems to me that the current interfaces evolved to what they are today mainly because of the way IB itself evolved - with a lot of uncertainly and a lot of design holes (not to say "craters"). This forced most of the industry to stick with very straight forward interfaces that were based on Mellanox VAPI.

 

            I wonder if this is not the right time to come up with much better abstraction - for user mode and for kernel mode. For example, it seems that the abstraction layer should abstract the IB networking objects and not the IB hca interface. In other words - why not to build the abstraction around IB networking types - UD, RC, RD, MADS? Why do we have to expose the memory management model of the driver/HCA to upper layers? Do we really want to expose IB related structures such as CQs, QPs,  and WQE? Why? Not only that this is not good for abstraction meaning changes in the drivers will require upper layers modifications, but also this is very problematic due security and stability reasons.

 

            I think that using correct abstraction is very critical for a real acceptance in the Linux/open source world. Good abstraction will enable us also to provide good and secure kernel mode and user mode interfaces and access.

 

            Once we have such interfaces, I think we should consider again the user/kernel division. As a general rule I think that it is commonly agreed that the kernel should include only things that must be in the kernel, meaning hardware aware software and very performance sensitive software. Other software modules may be inserted to the kernel once it is mature and robust. For example, RPC, NFSD and SMBFS (SAMBA) were developed in user mode, served many years in user mode and then after they have matured, they started to "sink" into the kernel. I think that IB, and especially IB management modules, are far from being mature. Even the IB standard itself is not really stable. Specifically, there is a requirement (in the SOW) to make the IB management distributed due some scalability and other (redundancy, etc.) requirements. I do not know if this requirement will actually realize, but if is will, the SM and maybe also the SMI/GSI agents and the CM will have to significantly change. If this is likely to happen, I would suggest keeping as much as possible in user mode - it is much easier to develop and to update. We should have kernel based agents and mechanism to assist the performance, but I think that most of the work should be done in user mode where it can harm less. Specifically, thinks such as MAD transaction manager (retries, timeouts, matching), RMPP and other should be developed in user mode and packed as libraries, again, at least until they will stabilize and mature. Why should we develop complicated functionality such as RMPP in the kernel when the only few kernel based queries (if any at all) will use them?

 

            If I not mistaken, one of the IB design goals was to enable efficient *user mode* networking (copy less, low latency). This is also the major advantage IB have over several alternatives - most remarkably 10G Ethernet. If we will not emphasize such advantages, we will reduce the survival chance of IB once 10GE will be widely used. If potential users will get the impression that comparing to 10GE IB is cheaper, faster, more efficient but requires tons of special kernel based modules and very complicated interfaces and therefore much less stable and much more exposed to bugs, they will use 10GE. I have no doubt. Yes, it is true that this project is meant to supply HPC code base, but eventually, IB will not survive as HPC interconnect only. Furthermore, all HPC applications are user mode based. Good user mode interfaces are critical for HCP not less then to any other high end networking applications.

 

            I really would like to know if I am shooting in the dark or that the issues I mentioned were discussed and there are good reasons to do them the way they are. Or, maybe I don't get the picture and the state of things is completely different from what I am painting. Either way I would like to know what do you think.

 

Thanks,

            Shahar

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/fe8c836f/attachment.html>


More information about the general mailing list