[openib-general] [PATCH] libmthca: optimize calls to htonl with constant parameter

Thu Feb 22 22:09:28 PST 2007

 > GCC seems to be unable to propogate constants across calls to htonl.
 > So it turns out to be worth the while to replace htonl with
 > a hand-written macro in case of constant parameter.

I'm wondering why this helps you.  On my system (which has Debian's
old glibc 2.3.6, certainly nothing particularly fancy), I see in
my <netinet/in.h>:

	/* Get machine dependent optimized versions of byte swapping functions.  */
	#include <bits/byteswap.h>

	#ifdef __OPTIMIZE__
	/* We can optimize calls to the conversion functions.  Either nothing has
	   to be done or we are using directly the byte-swapping functions which
	   often can be inlined.  */
	# if __BYTE_ORDER == __BIG_ENDIAN
	//...
	# else
	#  if __BYTE_ORDER == __LITTLE_ENDIAN
	#   define ntohl(x)	__bswap_32 (x)

and so on (and gcc defines __OPTIMIZE__ if you pass it any -O flag
including -Os).  And in <bits/byteswap.h> I have

	/* Swap bytes in 32 bit value.  */
	#define __bswap_constant_32(x) \
	     ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >>  8) |		      \
	      (((x) & 0x0000ff00) <<  8) | (((x) & 0x000000ff) << 24))

and variations of __bswap_32() that look roughly like

	#  define __bswap_32(x) \
	     (__extension__							      \
	      ({ register unsigned int __v, __x = (x);				      \
		 if (__builtin_constant_p (__x))				      \
		   __v = __bswap_constant_32 (__x);				      \
		 else								      \

and so on.  (The point of all this being that for constants, htonl()
should expand to roughly the same thing as your CONSTANT_HTONL() --
the only difference is that you don't have the & for the << 24 and >>
24 parts, which I guess just has the potential to bite us if someone
did something like CONSTANT_HTONL(1L) on a 64-bit system).

As a quick test I compiled the code

	#include <netinet/in.h>

	enum {
		Y = 5
	};

	uint32_t foo(uint32_t x)
	{
		return x | htonl(Y);
	}

with gcc -c -O and the disassembly of foo() looks like

	0000000000000000 <foo>:
	   0:	89 f8                	mov    %edi,%eax
	   2:	0d 00 00 00 05       	or     $0x5000000,%eax
	   7:	c3                   	retq   

and so everything works exactly the way we would want.  (32-bit i386
also just does or with a constant too).

In fact for libmthca I just checked that the preprocessor output of
places like the following (which your patch converts)

			((wr->send_flags & IBV_SEND_SIGNALED) ?
			 htonl(MTHCA_NEXT_CQ_UPDATE) : 0) |

is

   ((wr->send_flags & IBV_SEND_SIGNALED) ?
    (__extension__ ({ register unsigned int __v, __x = (MTHCA_NEXT_CQ_UPDATE); if (__builtin_constant_p (__x)) __v = ((((__x) & 0xff000000) >> 24) | (((__x) & 0x00ff0000) >> 8) | (((__x) & 0x0000ff00) << 8) | (((__x) & 0x000000ff) << 24)); else __asm__ ("bswap %0" : "=r" (__v) : "0" (__x)); __v; })) : 0) |

And if I compare the generated assembly for libmthca with and without
your patch (on both x86-64 and i386), I don't see any significant
difference (the size is exactly the same, I just see things like the
compiler using eax and edx in the opposite order and trivial things
like that).

So what is different in your setup that causes this patch to make a
difference for you?

(BTW, one thing I did notice while looking at the i386 assembly is
that one micro-optimization that might make sense to use something
like __attribute__((regparm(3))) for internal function calls within
libibverbs and libmthca on i386, since otherwise we waste instructions
pushing stuff on the stack for no reason other than compliance with
the crufty old i386 ABI.  Something like a FASTCALL macro in
<infiniband/arch.h> perhaps... if anyone really cares about 32-bit
i386 performance any more)

 - R.