[openib-general] FW: [PATCH 1 of 3] mad: large RMPP support

Wed Feb 8 08:28:39 PST 2006

Sorry for breaking the thread (Outlook is problematic).
Jack

-----Original Message-----
From: Jack Morgenstein 
Sent: Wednesday, February 08, 2006 6:23 PM
To: 'Sean Hefty'
Cc: Michael S. Tsirkin; 'rolandd at cisco.com'
Subject: RE: [PATCH 1 of 3] mad: large RMPP support

Sorry for not echoing to openib -- I'm having problems with mutt and our
server (replying to this from Outlook will not place the reply in the
thread).

I would much rather use the linked list.
We may need to allocate a rather large contiguous array (ib_mad_segments
segment array) for queries involving a large cluster, and such an
allocation has a larger probability of failure.

For example, a 1000 host cluster, with 2 ports per HCA will have at
least 4000 records in a SubnAdmGetTableResp for all PortInfo records on
the network (2000 for HCAs, and at least 2000 for the switch ports).
Such a query response will generate an RMPP of size 256K -- 1000
segments, or a 4K buffer on an X86 machine just for the array (assuming
one allocation per RMPP segment -- N=1).

b. Regarding using buffers which contain N RMPP segments, this becomes a
management nightmare:
	If choose N too large, we may fail to allocate segments in a
large RMPP, so that the entire RMPP fails (where it could succeed if
N=1).   Having N=1 guarantees that if we can succeed in our allocation,
we will.  I do not consider variable-size N within a single RMPP, since
this will be very complicatated and error-prone.

We could re-allocate everything if some N does not work -- also very
complex.

Regarding the order N-squared algorithm for finding the next RMPP
segment to send, MST and I agree that this is not acceptable.  We are
considering an algorithm which stores the current segment pointer in
"struct ib_mad_send_wr_private" so that when getting the next segment we
simply go to the "next" link.  We're still ironing out proper handling
of the "last acknowledged" processing (maintaining a pointer to the
last-acked segment, upgrading the last-acked pointer when a new ack
arrives -- this might still involve linear searches).

Regarding the payload pointer, I agree. It is also trivial to move it to
the ib_mad_send_wr_private structure, hiding it from the user.

Regarding the 64-byte boundary, why is this important?

Jack

-----Original Message-----
From: Sean Hefty [mailto:sean.hefty at intel.com] 
Sent: Wednesday, February 08, 2006 3:01 AM
To: Jack Morgenstein
Cc: openib-general at openib.org
Subject: RE: [PATCH 1 of 3] mad: large RMPP support

Based on what you've done, I'd like to suggest changing interface
similar to
that shown below.  I believe that this could be done with minor changes
to the
current patches.  Detailed comments that led to suggesting this change
are
inline in my responses.

struct ib_mad_segments {
	u32			num_segments;
	u32			segment_size;
	void			*segment[0];
};

struct ib_mad_send_buf {
	...
	void			*mad; /* First MAD segment */
	struct ib_mad_segments	*segments;	/* RMPP segments > 1 */
	...
};

This will avoid walking through a list to find segments, and allows for
efficient allocation of the segment data buffers.  Multiple segments
could be
allocated through a single kzalloc.  (For example, every n-th segment
would
start a new allocation, making deallocation easy as well.)

>+struct ib_mad_multipacket_seg {
>+	struct list_head list;
>+	u32 size;
>+	u8 data[0];
>+};

Should we ensure that the data alignment is on a 64-byte boundary?

> struct ib_mad_send_buf {
> 	struct ib_mad_send_buf	*next;
>-	void			*mad;
>+	void			*mad; /* RMPP: first segment,
>+					 including the MAD header */
>+	void			*mad_payload; /* RMPP: changed per
segment */

Mad_payload doesn't appear to be directly accessible directly by the
user.  It
should be hidden.

- Sean