[Openib-windows] wsd over mt23108 data corruption issue (x64)

Fabian Tillier ftillier at silverstorm.com
Wed May 10 10:52:53 PDT 2006


Hi Guy,

On 5/10/06, Guy Corem <guyc at voltaire.com> wrote:
>
> I'll try to explain why I've suspected memory pin down problem.
>
> On buffers bigger than 32MB, The IoAllocateMdl call in register_segment
> (mosal_iobuf.c) fails.
>
> MOSAL_iobuf_register try to call register_segment again with half the
> segment size.
>
> When trying to call with buffers >= 64MB, there are at least two
> failures of IoAllocateMdl.

That's expected - the code tries to use as few MDLs as possible, but
MDLs have a limit to how much memory they can reference.  So what ends
up happening is we build a chain of MDLs.

That said, it looks like only the first segment is rounded so that it
ends on a page boundary (if it failed to register in its entirety).  I
think there's an issue here with subsequent segments.

Say you have a starting VA of 0x12345678, and a length of 123456789.

The limit for the MDL is a touch under 32MB on a 64-bit system, let's
round to 32MB for the purpose of this example (33554432).

When we enter, we have the following values:
seg_num = 0;
seg_size = 123456789;
rdc = 123456789;
seg_va = 0x12345678

The first registration will try to allocate an MDL for the full range,
and fail.  The seg_size is cut in half:
seg_size = 61728394

And then the size is rounded so that the segment ends on a page boundary:
delta = (0x12345678 + 61728394) & 4095 = 3330
seg_size = 61729160

Now we try again and fail again.  seg_size is again cut in half:
seg_size = 30864580

And again rounded so the segment ends on a page boundary:
delta = (0x12345678 + 30864580) & 4095 = 2876
seg_size = 30865800

We register, and now succeed.
rdc = 92590989
seg_va = 0x140B5000 (note that it is page aligned - great)

We've now registered from 0x12345678 to 0x140B4FFF.

Time for the next segment - note that we don't update the seg size
here - and this size is not a multiple of pages, so we'll end in the
middle of a page.

So on we go, try to register the second segment (which will succeed
since seg_size is less than our limit).

We've now registered from 0x140B5000 to 0x15E24987.  Note that we
didn't end on a page boundary.  This means that the page starting at
0x15E24000 is the last page in this segment.

Let's update our values and keep going.
rdc = 61725189;
seg_va = 0x15E24988;

The third segment registers from 0x15E24988 to 0x17B94309 -- not page
aligned on either side!!!
rdc = 30859389
seg_va = 0x17B94310

This time, since rdc < seg_size, we adjust seg_size to register the
rest of the buffer:
seg_size = 30859389

The fourth segment registers from 0x17B94310 to 0x1990238C.

Assuming the virtual addresses have a 1:1 mapping with DMA addresses
(keeps this example simple), the MTT for the region will look like
this:

0x12345678
...
0x140B5000
...
0x15E24000
0x15E24000
...
0x17B94000
0x17B94000
...
0x19901000
0x19902000

Note the repeated pages!!!

This means that a request to/from that memory will never transfer the
data in pages 0x19901000 and 0x19902000, instead sending the data in
pages 0x15E24000 and 0x17B94000 twice.

Someone please check my logic here, but it does seem like this is the
flow given the current code.

If the receiver's registration is offset by 512 bytes compared to the
sender's, you get the same page range but a different offset.

So an RDMA read would read from 0x12345678 into 0x12345878.  This
means that the data from 0x15E24000 to 0x15E247FF would be read into
0x15E24200 to 0x15E24FFF, the data from 0x15E24800 to 0x15E24FFF would
overwrite 0x15E24000 to 0x15E241FF, and then you'd have a repeat.
0x15E24000 to 0x15E247FF would be read into 0x15E24200 to 0x15E24FFF
again, and then 0x14E24800 to 0x14E24FFF would be read into 0x15E25000
to 0x15E251FF.  Thus, the data at 0x15E24800 to 0x15E24FFF would
appear twice in the received buffer - once at 0x15E24000 to 0x15E241FF
and a second at 0x15E25000 to 0x15E251FF.

When you saw your 512 bytes duplicated, where they back to back, or
appart by 1 page?

What threw me originally is that the MTHCA driver has identical
registration code, but it is dead code, never referenced.

Try this, and let me know if you see the issue:

Index: hw/mt23108/vapi/mlxsys/mosal/os_dep/win/mosal_iobuf.c
===================================================================
--- hw/mt23108/vapi/mlxsys/mosal/os_dep/win/mosal_iobuf.c	(revision 334)
+++ hw/mt23108/vapi/mlxsys/mosal/os_dep/win/mosal_iobuf.c	(working copy)
@@ -270,7 +270,7 @@
   MT_virt_addr_t seg_va = va;	// current segment start
   MT_size_t seg_size = size;	// current segment size
   MT_size_t rdc = size;			// remain data counter - what is rest to lock
-  MT_size_t delta;				// he size of the last not full page of the first segment
+  MT_size_t delta = 0;			// he size of the last not full page of the
first segment
   MOSAL_iobuf_seg_t iobuf_seg_p; 	// pointer to current segment object
   unsigned page_size;

@@ -306,6 +306,11 @@
   	if (rc == MT_OK) {
 	  	rdc -= seg_size;
 	  	seg_va += seg_size;
+		if( delta )
+		{
+			seg_size += delta;
+			delta = 0;
+		}
 	  	new_iobuf->seg_num++;
 	  	if (seg_size > rdc)
 	  		seg_size = rdc;
@@ -320,7 +325,7 @@
   		// lessen the size
   		seg_size >>= 1;
   		// round the segment size to the page boundary (only for the first segment)
-  		if (new_iobuf->seg_num == 0) {
+  		//if (new_iobuf->seg_num == 0) {
 			rc = MOSAL_get_page_size( prot_ctx, seg_va, &page_size );
 			if (rc != MT_OK)
 			  	break;
@@ -329,7 +334,7 @@
   			seg_size += page_size;
 		  	if (seg_size > rdc)
 		  		seg_size = rdc;
-  		}
+  		//}
   		continue;
 	}
-------------- next part --------------
Index: hw/mt23108/vapi/mlxsys/mosal/os_dep/win/mosal_iobuf.c
===================================================================
--- hw/mt23108/vapi/mlxsys/mosal/os_dep/win/mosal_iobuf.c	(revision 334)
+++ hw/mt23108/vapi/mlxsys/mosal/os_dep/win/mosal_iobuf.c	(working copy)
@@ -270,7 +270,7 @@
   MT_virt_addr_t seg_va = va;	// current segment start
   MT_size_t seg_size = size;	// current segment size
   MT_size_t rdc = size;			// remain data counter - what is rest to lock
-  MT_size_t delta;				// he size of the last not full page of the first segment
+  MT_size_t delta = 0;			// he size of the last not full page of the first segment
   MOSAL_iobuf_seg_t iobuf_seg_p; 	// pointer to current segment object
   unsigned page_size;
   
@@ -306,6 +306,11 @@
   	if (rc == MT_OK) {
 	  	rdc -= seg_size;
 	  	seg_va += seg_size;
+		if( delta )
+		{
+			seg_size += delta;
+			delta = 0;
+		}
 	  	new_iobuf->seg_num++;
 	  	if (seg_size > rdc)
 	  		seg_size = rdc;
@@ -320,7 +325,7 @@
   		// lessen the size
   		seg_size >>= 1;
   		// round the segment size to the page boundary (only for the first segment)
-  		if (new_iobuf->seg_num == 0) {
+  		//if (new_iobuf->seg_num == 0) {
 			rc = MOSAL_get_page_size( prot_ctx, seg_va, &page_size );
 			if (rc != MT_OK) 
 			  	break;
@@ -329,7 +334,7 @@
   			seg_size += page_size;
 		  	if (seg_size > rdc)
 		  		seg_size = rdc;
-  		}
+  		//}
   		continue;
 	}
 


More information about the ofw mailing list