[openib-general] ib_mad.c comments

Sat Sep 11 04:22:35 PDT 2004

On Wed, 2004-09-08 at 23:07, Roland Dreier wrote:
> Huh... I didn't even notice it was checked in... 

It was checked in for potential use at the SWG meeting. Thanks for the
comments on the "early" code. They will help get us there more quickly.

> anyway, my comments follow after some comments on Sean's comments:
> 
>     Sean> Need to lock when checking/setting version/class/methods.
> 
> I agree for the initial implementation.  Ultimately RCU seems better
> but I would recommend sticking with locking to start with since it's
> much easier to code correctly.

I will put replacing locking with RCU on the futures list for this.

>     Sean> We should return the error code from ib_post_send in order
>     Sean> to handle overruns differently.
> 
> What did we decide about how to handle someone posting more sends than
> the underlying work queue can hold?  

Last I recall, we deferred this issue. Maybe we should just defer the
implementation but decide what should be done.

A conservative implementation would ensure all sends could be posted
before allowing any of them to be posted. This has an extra cost to
determine whether they all can be posted and doesn't keep the QP as full
as possible. That's one end of the spectrum. 

What happens if only some of the sends get out initially ? The client
would need to repost the remainder. If the remainder couldn't be posted
soon enough, some timeout might occur. This seems like a more optimistic
implementation. Does anyone see any issues with this ? Is it reasonable
to start with this approach ?

An intermediate approach would defer the sends of the ones which
couldn't be posted. Perhaps there would be some timeout before if all
the sends couldn't be posted that it is treated as an error.

> In any case I agree with this.

Sent patch for this.

>     Sean> Should we avoid casting the list_head to a structure where
>     Sean> possible?
> 
> Yes, definitely.  It's much better to do something like
> &mystruct->list rather than relying on the fact that mystruct has a
> struct list_head as its first member. In fact the usage of list.h is
> pretty broken throughout ib_mad.c, see below.

This is on my short term TODO list to fix this.

>     Sean> Not sure why these calls search for the corresponding work request.
> 
> Yes -- we know the next request to complete will always be the oldest
> one we have around, right?

On the send side but that's not the case on the receive side as there
are posts for multiple QPs. Maybe there should be a list per QP and then
this would be true which eliminates the need to walk the list. The
implementation is rapidly heading towards this.

>     Sean> Not sure if we need to support max_sge on the send queue.
>     Sean> This may be substantially larger than what we need.  At a
>     Sean> minimum, I think that we need 2 for optimal RMPP support.
>     Sean> I'm not sure where the trade-off is between SGE versus
>     Sean> copying into a single buffer lies.
> 
> I'm not sure there's much practical difference between copying and
> using the HCA to do a gather/scatter on a buffer of size 256.

Any idea at what buffer size there is a difference ?

> The big difference is memory per WQE (at least for mthca): supporting the
> max_sge means each WQE will be about 1 KB, while using a smaller
> number means each WQE could be about 128 bytes.

I fixed this and sent a patch.

> OK, my comments (which are based on only a quick read and therefore
> focused mostly on low-level coding details):
> 
> 	kmem_cache_t *ib_mad_cache;
> 
> seems to be unused -- should be static anyway.

I changed this to static. I will remove this is it remains unused.

> 	static u32 ib_mad_client_id = 0;
> 
> needs to be protected by a lock when used later

This will be locked. Rather than a linked list, this will become an
indexed table as this will be more efficient when looking up the client
in the receive path.

> 	#define IB_MAD_DEVICE_LIST_LOCK_VAR	unsigned long ib_mad_device_list_sflags
> 	#define IB_MAD_DEVICE_LIST_LOCK()	spin_lock_irqsave(&ib_mad_device_list_lock, ib_mad_device_list_sflags)
> 	#define IB_MAD_DEVICE_LIST_UNLOCK()	spin_unlock_irqrestore(&ib_mad_device_list_lock, ib_mad_device_list_sflags)
> 
> Don't use this idiom ... just use the spinlock functions directly.  It
> makes locking code harder to read and review, it leads to wasteful
> stuff like the below (in ib_mad_reg()):
> 
> 	IB_MAD_DEVICE_LIST_LOCK_VAR;
> 	IB_MAD_AGENT_LIST_LOCK_VAR;
> 
> and besides, Documentation/CodingStyle says
> 
>    "macros that depend on having a local variable with a magic name
>     might look like a good thing, but it's confusing as hell when one
>     reads the code and it's prone to breakage from seemingly innocent
>     changes."

This is on my short term TODO list.

> 	/*
> 	 * ib_mad_reg - Register to send/receive MADs.
> 	 * @device - The device to register with.
> 
> Start with /** for kernel doc to pick this up.  Might be better to put
> it in a header file so that it's easier to find the documentation (but
> it's OK to leave it in a .c).

It's a copy of what is in ib_mad.h. I will eliminate it from ib_mad.c

> 	struct ib_mad_device_private *entry, *priv = NULL,
> 				     *head = (struct ib_mad_device_private *) &ib_mad_device_list;
> 
> This definition of head is totally broken, since ib_mad_device_list is
> declared as:
> 
> 	static struct list_head ib_mad_device_list;
> 
> so trying to use it as a struct ib_mad_device_private is just going
> off into random memory.  However there's no reason to even have a
> variable named head, since it seems you only use it in:
> 
> 	list_for_each(entry, head) {
> 
> This really should be
> 
> 	list_for_each_entry(entry, &ib_mad_device_list, list) {
> 
> and the definition of struct ib_mad_device_private needs to be fixed from
> 
> 	struct ib_mad_device_private {
> 		struct ib_mad_device_private *next;
> 
> to
> 
> 	struct ib_mad_device_private {
> 		struct list_head list;
> 
> (you don't have to use the name list for your struct list_head member;
> that's just my habit).
> 
> 	list_for_each(entry2, head2) {
> 		if (entry2->agent == mad_agent_priv->agent) {
> 			list_del((struct list_head *)entry2);
> 			break;
> 		}
> 	}	
> 
> This is broken for a couple of reasons: misuse of list_for_each as
> just described; also, you can't delete items from a list while walking
> through it with list_for_each (use list_for_each_safe instead);
> finally, there's no reason to walk a list to find the entry you just
> added in the same function -- just call list_del on the entry
> directly, since you should still have it around.
> 
> Pretty much all of these comments apply to all use of the list.h
> macros in the file -- most look wrong.

This is on my short term TODO list to fix this.

> What context is it allowed to call ib_mad_post_send() from?  We never
> discussed this, but since the current implementation allocates work
> requests with
> 
> 	mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_KERNEL);
> 
> right now it can only be called from process context with no locks
> held.  This seems like it violates the principle of least surprise,
> because ib_post_send() can be called from any context.  

I will make ib_mad_post_send also callable from any context.

> Also, the failure case
> 
> 	if (!mad_send_wr) {
> 		printk(KERN_ERR "No memory for ib_mad_send_wr_private\n");
> 		return -ENOMEM;	
> 	}
> 
> needs to set bad_send_wr.

Fixed. Sent patch for this.

> ib_mad_recv_done_handler() seems to be missing a call to pci_unmap_single().

Yes, I didn't get that far (the receive path is incomplete). I will
finish it this weekend.

> 	static u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req)
> 	{
> 		u8 mgmt_class;
> 	
> 		/* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */
> 		if (mad_reg_req->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
> 			mgmt_class = 0;
> 		} else {
> 			mgmt_class = mad_reg_req->mgmt_class;
> 		}
> 		return mgmt_class;
> 	}
> 
> I'd rewrite this as
> 
> 	static inline u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req)
> 	{
> 		return mad_reg_req->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ?
> 			0 : mad_reg_req->mgmt_class;
> 	}
> 
> or just open code it in the two places it's used.

I chose the inline approach. Sent patch for this.

> 	static int allocate_method_table(struct ib_mad_mgmt_method_table **method)
> 	{
> 		/* .. */
> 			return ENOMEM;
> 
> probably should be -ENOMEM;

I also found more that should have been negative. Sent patch for this.

> 	static void ib_mad_completion_handler(struct ib_mad_device_private *priv)
> 	{
> 	
> 		/*
> 		 * For stack overflow safety reason, WC is static here.
> 		 * This callback may not be called more than once at the same time.
> 		 */
> 		static struct ib_wc wc;
> 
> Seems like a bad plan to me -- on an SMP machine with multiple HCAs
> (or even multiple ports on a single HCA) it seems like we want to
> multithread MAD processing rather than serializing it (In fact Yaron
> has made a lot of noise about running on giant SGI NUMA machines with
> millions of HCAs, where this looks especially bad).  Also the comment
> seems to be wrong -- there seems to be one thread per HCA so multiple
> copies of the callback can run at once.

There used to be a problem with not doing this with some old Linux
kernels. I will eliminate the static WC and the comment. Sent patch for
this.

> 	static int ib_mad_thread(void *param)
> 	{
> 		struct ib_mad_device_private *priv = param;
> 		struct ib_mad_thread_data *thread_data = &priv->thread_data;
> 	
> 		lock_kernel();
> 		daemonize("ib_mad-%-6s-%-2d", priv->device->name, priv->port);
> 		unlock_kernel();
> 
> Just use kthread_create() to start your thread and handle all this.
> Even though the current Topspin stack uses a MAD processing thread per
> HCA, I'm not sure it's the best design.  

This design uses a thread per port. 

> Why do we need to defer work to process context?

To minimize time spent in non process context. Is this an overly
conservative approach ? 

> 		sema_init(&thread_data->sem, 0);
> 
> Seems like a race condition here ... what happens if someone else
> tries to use the semaphore before the thread has gotten a chance to
> run?  

Eliminated this race by moving the semaphore init to into thread
initilization. Sent patch.

> In any case...
> 
> 		while (1) {
> 			if (down_interruptible(&thread_data->sem)) {
> 				printk(KERN_DEBUG "Exiting ib_mad thread\n");
> 				break;
> 			}
> 
> I don't think it's a good idea to use a semaphore and signals to
> control the worker thread.  Better would be a wait queue and something
> like wait_event().

Is there a problem with this or is this an efficiency issue ?

> 	#define IB_MAD_DEVICE_SET_UP(__device__) {\
> 		IB_MAD_DEVICE_LIST_LOCK_VAR;\
> 		IB_MAD_DEVICE_LIST_LOCK();\
> 		(__device__)->up = 1;\
> 		IB_MAD_DEVICE_LIST_UNLOCK();}
> 
> 	#define IB_MAD_DEVICE_SET_DOWN(__device__) {\
> 		IB_MAD_DEVICE_LIST_LOCK_VAR;\
> 		IB_MAD_DEVICE_LIST_LOCK();\
> 		(__device__)->up = 0;\
> 		IB_MAD_DEVICE_LIST_UNLOCK();}
> 
> These don't seem to merit being macros.  If you really want they
> could be inline functions but I don't see any use of the "up" member
> outside of the macros anyway, so maybe you can just kill them.  It
> seems hard to think of how to test "up" in a way that's not racy.

up was to be used to qualify whether to proceed with posting MADs to
send. Why is it racy if the check of it also takes the lock ?

> 	for (i = 0; i < num_ports; i++) {
> 		ret = ib_mad_device_open(device, i);
> 
> This is wrong -- for a CA you need to handle ports 1 ... num_ports,
> while a switch just uses port 0.

Fixed. Sent patch.

-- Hal