[ofa-general] [RFC] opensm: cl_qlock_pool benchmark
Sasha Khapyorsky
sashak at voltaire.com
Sun Dec 9 06:18:22 PST 2007
Hi,
I looked at possibility to optimize and simplify SA requests processing
in OpenSM and found that very common practice there is to use
cl_qlock_pool* as a records allocator (it must be locked because same
type of requests shares the pool). It is also used as MAD allocator (via
osm_mad_pool).
Looking at implementation of q[lock_]pool I thought that it would be
interesting to compare its performance with standard malloc, which by
itself should be reasonably fast. So I wrote some stupid program
test_pool.c (do_nothing() here is for preventing from smart optimizer to
drop some cycles):
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <complib/cl_qlockpool.h>
#include <complib/cl_qpool.h>
#define USE_MALLOC 1
#define USE_QPOOL 1
#ifdef USE_MALLOC
#define cl_qlock_pool_get(p) malloc(sizeof(*item))
#define cl_qlock_pool_put(p, mem) free(mem)
#else
#ifdef USE_QPOOL
#define cl_qlock_pool_t cl_qpool_t
#define cl_qlock_pool_construct(p) cl_qpool_construct(p)
#define cl_qlock_pool_init(p, a, b, c, d, e, f, g) \
cl_qpool_init(p, a, b, c, d, e, f, g)
#define cl_qlock_pool_destroy(p) cl_qpool_destroy(p)
#define cl_qlock_pool_get(p) cl_qpool_get(p)
#define cl_qlock_pool_put(p, mem) cl_qpool_put(p, mem)
#endif
#endif
typedef struct item {
cl_pool_item_t pool_item;
char data[64];
} item_t;
#define POOL_MIN_SIZE 32
#define POOL_GROW_SIZE 32
#define N_TESTS 1000000000
static void do_nothing(struct item *items[], unsigned n)
{
int i;
for (i = 0 ; i < n ; i++) {
if (!strcmp(items[i]->data, "12345678"))
printf("Yes!!!\n");
}
}
static int pool_get_and_put_items(cl_qlock_pool_t *p, unsigned n)
{
struct item *items[n];
struct item *item;
int i;
for (i = 0 ; i < n ; i++) {
item = (struct item *)cl_qlock_pool_get(p);
if (!item)
return -1;
memset(item->data, 0, sizeof(item->data));
items[i] = item;
}
do_nothing(items, n);
for (i = 0 ; i < n ; i++)
cl_qlock_pool_put(p, &items[i]->pool_item);
return 0;
}
static int test_pool()
{
cl_qlock_pool_t pool;
int i, j, status;
cl_qlock_pool_construct(&pool);
status = cl_qlock_pool_init(&pool, POOL_MIN_SIZE, 0, POOL_GROW_SIZE,
sizeof(struct item), NULL, NULL, NULL);
for (i = 0 ; i < N_TESTS; i++)
if (!pool_get_and_put_items(&pool, 1000000))
return -i;
for (i = 0 ; i < N_TESTS; i++) {
if (!pool_get_and_put_items(&pool, 1000000))
return -i;
for (j = 0; j < N_TESTS; j++)
if (!pool_get_and_put_items(&pool, 1000000))
return -i;
}
cl_qlock_pool_destroy(&pool);
return 0;
}
int main()
{
int ret = test_pool();
return ret;
}
And got such typical numbers:
* with cl_qlock_pool:
real 0m0.541s
user 0m0.488s
sys 0m0.056s
* with cl_qpool:
real 0m0.350s
user 0m0.288s
sys 0m0.060s
cl_qpool is much faster, it is expected since locking cycle is skipped
there.
* with regular malloc/free:
real 0m0.292s
user 0m0.216s
sys 0m0.072s
And this one is *fastest*.
In this test I used various numbers for subsequent test cycles and
different optimization flags - numbers ratios still be similar.
This shows that regular malloc/free is fastest allocator, then used it
doesn't require locking (all allocations are per individual request) and
it is more than twice faster than current cl_qlock_pool.
Obvious question is why to not convert from cl_qlock_pool? Probably some
holes in the test? Any thoughts?
Sasha
More information about the general
mailing list