[ofa-general] Bogus Receive Completions

Roland Dreier rdreier at cisco.com
Tue Dec 4 22:20:19 PST 2007


Thanks for the excellent bug report!  With the test case to reproduce
this, resolving the issue should be pretty quick.

 > I have weird behavior of libibverbs + libmthca, which makes me suspicious
 > about either libmthca or the HCA firmware.

I was able to reproduce this here with libibverbs 1.1.1 and libmthca
1.0.4, but only on HCAs running in non-mem-free mode, which makes me
think it must be a firmware issue.  In fact, I get the failure you see
almost instantly on a system with HCA running FW 4.8.917, but if I
take the exact same system and just update the HCA to FW 5.2.917 (so
the HCA runs in mem-free mode) then the test runs for a long time with
no problem (up to iteration 150327477 so far).

I added a little debugging patch to src/cq.c in libmthca, and I found
that when the failure happened, the CQE had a WQE address that was out
of sequence -- the RQ has size 0x200 with 0x20 byte WQEs, and the CQEs
had WQE address 0x100 then WQE address 0x0; or address 0x0 then 0x140;
or even 0x80 twice in a row.

Mellanox: can you take this test case and see if it is indeed a
firmware issue?  I could believe that there is a bug in libmthca's
mthca_tavor_post_recv() function too...

BTW, here are a few comments about things I had to fix to run the test
case:

 > 	memset(&attr1,0,sizeof(attr1));

I needed to add "#include <string.h>" to get a prototype for memset()...

 > 	hints.ai_family=AF_UNSPEC;
 ...
 > 	struct sockaddr remote_saddr;
 > 	socklen_t remote_saddrlen=sizeof(remote_saddr);
 > 	int hsock=accept(ssock,&remote_saddr,&remote_saddrlen);
 > 	close(ssock);
 > 	assert(hsock>=0);
 > 	assert(remote_saddrlen==sizeof(remote_saddr));

On my system at least, using AF_UNSPEC led to accept() returning an
IPv6 socket address, and actually sizeof (struct sockaddr_in6) is 28,
which is bigger than sizeof (struct sockaddr), so this last assert
failed for me.  I fixed this by setting hints.ai_family to AF_INET.

 > 	printf("accepted connection from %i.%i.%i.%i\n",remote_saddr.sa_data[2],remote_saddr.sa_data[3],remote_saddr.sa_data[4],remote_saddr.sa_data[5]);

It didn't cause anything but a cosmetic issue, but on my system at
least, sa_data is an array of signed char, so if any of the octets in
the remote address are > 128, they printed out as negative numbers.  I
fixed this by adding casts to uint8_t here.

 - R.



More information about the general mailing list