[ofa-general] [RFC] opensm: cl_qlock_pool benchmark

Sun Dec 9 06:18:22 PST 2007

Hi,

I looked at possibility to optimize and simplify SA requests processing
in OpenSM and found that very common practice there is to use
cl_qlock_pool* as a records allocator (it must be locked because same
type of requests shares the pool). It is also used as MAD allocator (via
osm_mad_pool).

Looking at implementation of q[lock_]pool I thought that it would be
interesting to compare its performance with standard malloc, which by
itself should be reasonably fast. So I wrote some stupid program
test_pool.c (do_nothing() here is for preventing from smart optimizer to
drop some cycles):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <complib/cl_qlockpool.h>
#include <complib/cl_qpool.h>

#define USE_MALLOC 1
#define USE_QPOOL 1

#ifdef USE_MALLOC
#define cl_qlock_pool_get(p) malloc(sizeof(*item))
#define cl_qlock_pool_put(p, mem) free(mem)
#else
#ifdef USE_QPOOL
#define cl_qlock_pool_t cl_qpool_t
#define cl_qlock_pool_construct(p) cl_qpool_construct(p)
#define cl_qlock_pool_init(p, a, b, c, d, e, f, g) \
	cl_qpool_init(p, a, b, c, d, e, f, g)
#define cl_qlock_pool_destroy(p) cl_qpool_destroy(p)
#define cl_qlock_pool_get(p) cl_qpool_get(p)
#define cl_qlock_pool_put(p, mem) cl_qpool_put(p, mem)
#endif
#endif

typedef struct item {
	cl_pool_item_t pool_item;
	char data[64];
} item_t;

#define POOL_MIN_SIZE      32
#define POOL_GROW_SIZE     32

#define N_TESTS 1000000000

static void do_nothing(struct item *items[], unsigned n)
{
	int i;
	for (i = 0 ; i < n ; i++) {
		if (!strcmp(items[i]->data, "12345678"))
			printf("Yes!!!\n");
	}
}

static int pool_get_and_put_items(cl_qlock_pool_t *p, unsigned n)
{
	struct item *items[n];
	struct item *item;
	int i;

	for (i = 0 ; i < n ; i++) {
		item = (struct item *)cl_qlock_pool_get(p);
		if (!item)
			return -1;
		memset(item->data, 0, sizeof(item->data));
		items[i] = item;
	}

	do_nothing(items, n);

	for (i = 0 ; i < n ; i++)
		cl_qlock_pool_put(p, &items[i]->pool_item);

	return 0;
}

static int test_pool()
{
	cl_qlock_pool_t pool;
	int i, j, status;

	cl_qlock_pool_construct(&pool);

	status = cl_qlock_pool_init(&pool, POOL_MIN_SIZE, 0, POOL_GROW_SIZE,
				    sizeof(struct item), NULL, NULL, NULL);
	for (i = 0 ; i < N_TESTS; i++)
		if (!pool_get_and_put_items(&pool, 1000000))
			return -i;

	for (i = 0 ; i < N_TESTS; i++) {
		if (!pool_get_and_put_items(&pool, 1000000))
			return -i;
		for (j = 0; j < N_TESTS; j++)
			if (!pool_get_and_put_items(&pool, 1000000))
				return -i;
	}

	cl_qlock_pool_destroy(&pool);

	return 0;
}

int main()
{
	int ret = test_pool();

	return ret;
}

And got such typical numbers:

* with cl_qlock_pool:

real    0m0.541s
user    0m0.488s
sys     0m0.056s

* with cl_qpool:

real    0m0.350s
user    0m0.288s
sys     0m0.060s

cl_qpool is much faster, it is expected since locking cycle is skipped
there.

* with regular malloc/free:

real    0m0.292s
user    0m0.216s
sys     0m0.072s

And this one is *fastest*.

In this test I used various numbers for subsequent test cycles and
different optimization flags - numbers ratios still be similar.

This shows that regular malloc/free is fastest allocator, then used it
doesn't require locking (all allocations are per individual request) and
it is more than twice faster than current cl_qlock_pool.

Obvious question is why to not convert from cl_qlock_pool? Probably some
holes in the test? Any thoughts?

Sasha