[ofw] Problem with "Avoid the SM" patch

Tzachi Dar tzachid at mellanox.co.il
Thu Sep 11 14:36:26 PDT 2008


Hi Fab,

Here is some more information about the issue and one question.
There are currently two problems that we see. Both problems start after
we restart opensm.

1) After we restart opensm arp messages don't pass. The main reason we
saw so far is that they are sent with the wrong addresses. Although we
haven't still found exactly why that is, we will soon find that and fix
it.

2) This is the more problematic issue: After we restart opensm
__endpt_mgr_reset_all is being called. As a result all our endpoint
cache is cleared. Please note that windows is not aware of what happened
and therefore it doesn't generate arps but rather sends unicast packets.
For this packets we don't have enough information in the end point and
therefore we can't send them correctly. In the past for this packets we
used to do a query on the SM, but we don't want to do that anymore.
So my question is this, how do we want to solve this issue:
1) Wait for the windows arp table to flash? Probably too long.
2) Send queries to the SM? We wanted to avoid that.
3) Don't clear the endpoints when opensm is being restarted? Seems that
we might use old data.
4) Send arps by ourselves? Probably the best solution but requires some
more work.

What do you think?

Thanks
Tzachi

> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org 
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Fab Tillier
> Sent: Wednesday, September 10, 2008 8:08 PM
> To: Alex Naslednikov; Reuven Amitai; Leonid Keller
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [ofw] Problem with "Avoid the SM" patch
> 
> Hi Xalex,
> 
> >From: Alex Naslednikov [mailto:xalex at mellanox.co.il]
> >Sent: Wednesday, September 10, 2008 9:56 AM
> >
> >Hello Fab,
> >Finally, we found the problem and continue in order to fix it.
> >Here is the description.
> >
> >2. Today we found, that this is not the original problem.
> >The problem occurs when one restarts (kill and run) opensm. New 
> >instance of opensm initalize to zero some fields in AV.
> 
> I'm confused about this - how does the SM initialize AV fields?
> 
> >We found with IB analyser sends with REMOTE_LID==0 right after 
> >restarting the opensm, that caused to PING to fail.
> 
> Did the ARP request get sent properly?  What about the ARP 
> response?  Where the contents of these packets 'sane'?
> 
> >Also, it's not related to the kind of connection, we tried it on 
> >back-to-back connection as well as on switch connection
> 
> Ok, that's good I suppose.
> 
> >3. We continue to debug in order to provide the solution. 
> Please, let 
> >us know if you have some proposal to resolve this issue
> 
> Did you see this running on top of the 1486 revision, or 
> against the head?  I worry that with all the recent changes 
> to IPoIB that a change might have been lost.
> 
> -Fab
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 



More information about the ofw mailing list