[ofw] Problem with "Avoid the SM" patch
Fab Tillier
ftillier at windows.microsoft.com
Fri Sep 12 08:59:45 PDT 2008
> Hi Fab,
>
> Here is some more information about the issue and one question.
> There are currently two problems that we see. Both problems start after
> we restart opensm.
>
> 1) After we restart opensm arp messages don't pass. The main reason we
> saw so far is that they are sent with the wrong addresses. Although we
> haven't still found exactly why that is, we will soon find that and fix
> it.
Is it just a problem with the ARP responses, or the requests too? The requests should be getting sent to the broadcast group, so they should work. The response is a unicast packet, so could be getting lost due to the dlid == 0 issue.
> 2) This is the more problematic issue: After we restart opensm
> __endpt_mgr_reset_all is being called. As a result all our endpoint
> cache is cleared. Please note that windows is not aware of what happened
> and therefore it doesn't generate arps but rather sends unicast packets.
> For this packets we don't have enough information in the end point and
> therefore we can't send them correctly. In the past for this packets we
> used to do a query on the SM, but we don't want to do that anymore. So
> my question is this, how do we want to solve this issue:
> 1) Wait for the windows arp table to flash? Probably too long.
> 2) Send queries to the SM? We wanted to avoid that.
> 3) Don't clear the endpoints when opensm is being restarted?
> Seems that we might use old data.
> 4) Send arps by ourselves? Probably the best solution but requires
> some more work.
>
> What do you think?
I think the key here might be to keep *some* of the SM interaction - effectively put a path record cache in IPoIB. If we kept the existing path record query logic in IPoIB the issues with SM restart go away. We would then need to change how the MAC_TO_PATH IOCTL behaved, allowing requests to be queued and completed asynchronously. The IOCTL handler would look up the endpoint, and if no path was resolved would issue the path query if it wasn't in progress already. This would require queueing the IRPs and tracking them so that a path query completion would complete any pending IRPs.
Probably the simplest way to handle this would be to queue the IRPs in the IBAT layer when they come in, and then try to flush as many IRPs from the queue (look to see if the endpoints have valid paths). Any endpoint that needs a path would have a query issued, and a path query completion would again try to flush as many IRPs form the IBAT queue as possible.
The main advantages to this is that real path records would be used for unicast traffic as well as IBAT clients, so that the packet rate, MTU, and so forth are set optimally, but the cache would be updated whenever an ARP response is received, remaining in sync with the network stack.
I hope the SM would not have a problem with path queries like this - the query load would grow as the square of number of nodes, rather than the square of the number of cores.
-Fab
More information about the ofw
mailing list