[openib-general][PATCH 1 of 3] repost: Client Reregister support for kernel space

Sean Hefty mshefty at ichips.intel.com
Wed May 31 14:15:24 PDT 2006


Eitan Zahavi wrote:
> [EZ] The race is happening when the SM received the request and
> responded but the other SMs or the file system did not fully stored that
> registration and the SM crashed.

If the client received a response that the join was successful, then I consider 
that an SM issue.

The problem is that the SM lost *its* state information.  Requiring end nodes to 
maintain the SM's state for it still doesn't make sense to me.  Your converting 
an SM issue into a requirement that all end nodes must support for proper operation.

Why can't the local system store the same data in another process?  (E.g. record 
all join MADs that have been processed by the SM.)  Why can't that data be saved 
to disk?  Why can't some other arbitrary system in the fabric save that data?

I still believe that there are a lot of potential solutions to this problem than 
requiring end nodes to maintain the SM's state.

- Sean



More information about the general mailing list