[Openib-windows] RE: Errors handeling on Winsock direct

Tzachi Dar tzachid at mellanox.co.il
Fri Oct 28 04:36:11 PDT 2005


Hi Fab,

Error handling is probably one of the most complicated areas in programming
(or should I say computer science?). I don't want to start an entire
discussion about how it should be handled as there are many approaches to
this problem, but I do want to give some few guidelines:

1) The first thing to understand is that there are two types of operations:
the first is a must succeed one, while the other is a "might" succeed.
Examples for the first type are (the list is very short):
int a;

a = 5;

or SpinLock Lock1
Lock1.lock()

Examples for the second are almost all other functions. Please note that
each function that depends on allocating memory might fail. And there for
each function that calls it might also fail and so on. Depending on your
hardware model you might have to assume that operations that you do on your
hardware might also fail. I believe that this is the case with InfiniBand
operations, in which operations might fail due to errors that are in the
firmware, the cable or in the HCA.

Generally speaking a never fail functions are very hard to design, but
writing a program without them is impossible. On the other hand if a
function might fail, the failure has to be reported.

In general a component like WSD should pass an error that it detects to
upper layers. These errors should be propagated to be winsock errors later.
Applications have some freedom on  what to do, but they too can't ignore
errors. 

In any case, back to our problem - the function ib_poll_cq() is a "might"
fail function, and therefore I don't see a way that it callers are not (that
is the callers should return some kind of an error). Please note that
deciding that it is a must succeed means that all the functions that it uses
(and so on recursively) are must succeed. It also means that we have to
verify that there are no errors that can be returned from the hardware (for
example if we have a problem with a cable). More, we also have to verify
that also in future implementations all this functions will remain must
succeed.

As a consequence to all of this, I believe that the interface (and
implementation) of ib_cq_comp() should also change to return some error. Of
course, this doesn't have to happen immediately.

Thanks
Tzachi
>-----Original Message-----
>From: Fab Tillier [mailto:ftillier at silverstorm.com]
>Sent: Friday, October 28, 2005 2:07 AM
>To: 'Dror Goldenberg'; Tzachi Dar; openib-windows at openib.org
>Subject: RE: [Openib-windows] RE: Errors handeling on Winsock direct
>
>> From: Dror Goldenberg [mailto:gdror at mellanox.co.il]
>> Sent: Thursday, October 27, 2005 3:50 PM
>>
>> > From: Fab Tillier [mailto:ftillier at silverstorm.com]
>> >
>> > > As a meter of fact the problem is harder, since the function
>> > > ib_cq_comp doesn't have a way to return an error. Please note
>> > > that returning 0 is not enough because 0 is a legal option if
>> > > nothing was found. As  a result, there is a need to change all
>> > > the callers of this function, so  that the error will be
>> > > propagated to there caller.
>> >
>> > What will the callers do with the propagated value?  Would all
>> > cases of errors on the CQ be accompanied by async affiliated
>> > error notification for that CQ?  What about for the QP?
>>
>> Not sure I understand the question, but in general there are
>> problems that are:
>> - reported to the CQ (without async error), those are errors
>>    that have a WQE affiliated with them
>
>These will result in a work completion being retrieved by ib_poll_cq and
>are
>handled.
>
>> - reported as async errors, no CQE generated because there is no WQE
>> affiliated with them
>
>Async CQ errors aren't currently handled, but should be.  Handling these is
>going to be somewhat complicated, but certainly not impossible, due to the
>one
>to many CQ to QP relationship.
>
>> - reported as async errors because CQ overflows
>
>CQ overflow cannot happen due to the design of the provider.
>
>- Fab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20051028/90dc0ceb/attachment.html>


More information about the ofw mailing list