[ofw] Opensm or umad bug

Hefty, Sean sean.hefty at intel.com
Thu Apr 28 08:14:51 PDT 2011


> We have the following code commented out at umad_receiver_stop:
> 
>         /* XXX hangs current thread - suspect umad_recv() ignoring wakeup.
> 
>         cl_thread_destroy(&p_ur->tid);
> 
>         */
> 
> How can one ensure that umad_receiver thread will not run after
> osm_vendor_delete was called ?

umad_recv() does the following basic operations that can block the calling thread:

	ResetEvent(port->overlap.hEvent);
	hr = port->prov->Receive(mad, sizeof(WM_MAD) + (size_t) *length, &port->overlap);
	if (hr == WV_IO_PENDING) {
		hr = WaitForSingleObject(port->overlap.hEvent, (DWORD) timeout_ms);
		if (hr == WAIT_TIMEOUT) {
			hr = umad_cancel_recv(port);
			// umad_cancel_recv does:
			// port->prov->CancelOverlappedRequests();
			// return port->prov->GetOverlappedResult(&port->overlap, &bytes, TRUE);

There are 2 blocking calls, WaitForSingleObject (obviously) and GetOverlappedResult.  The latter should not block for an extended period of time, since the overlap request was canceled on the previous call.

I don't see in the documentation for WaitForSingleObject that signaling the thread unblocks it.  For Windows, we could allow the user to signal the underlying event directly or expose the internal umad_cancel_recv call.  libibverbs had to expose similar OS specific functionality.

Does anyone know how opensm unblocks the receive thread in Linux?

- Sean



More information about the ofw mailing list