gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Coding > CpuFastSet, volatile and inline

#113825 - kusma - Sun Dec 31, 2006 1:53 am

Hi there, I've been debugging this one issue almost all day long, and I'm finally getting somewhere. The problem I have is CpuFastSet not working properly (or rather not working at all) in some cases when inlined into a function. I've made a small test-application, and put it up here:

http://pimpmobile.kjip.no/temp/volatile_buffer.zip

This illustrates the problem. With a word count that is not a multiple of eight, everything seems to go fine, but as soon as I try to clear a multiple of eight words, everything goes poo.

Now, try playing around with the flags INLINE_FJALL, INLINE_CLEAR_BUFFER and VOLATILE_BUFFER and see what happens. The conclusions I managed to make was that with full inlining everything goes fine with or without volatile. With full inlining and no volatile, only one element gets corrupted when doing 16 words. With 15 all is fine. with volatile, everything is fine.

What amazes me, is the fact that volatile changes anything at all. This is global memory, no restricts or consts or anything to mess with guarantees. Any function should at any time be able to change that memory.

Does anybody have any ideas about this one?

#113829 - keldon - Sun Dec 31, 2006 2:59 am

http://nocash.emubase.de/gbatek.htm#biosmemorycopy

Source address and destination address must be word aligned.

Ahh:
Code:
   for (i = words & 7; i; --i)

Will only ever iterate from 7-1. Is that intended? This would explain why one value gets corrupted.

Code:
for (i = 0; i < 16; ++i)
   {
      buffer[i] = 1;
   }

Should be:
Code:
for (i = 0; i < 16; i++)
   {
      buffer[i] = 1;
   }


You are increasing i before it enters the loop.

Also if ( i & ~7) is equivalent to if (i >= 8).

I understand that your method, clear_buffer aims to clear the words that CpuFastSet does not. Beware of 'clever code' and anything that looks 'difficult'!

Maybe change for:
Code:
   for (i = 0; i < ( words&7); i++ )
   {
      *dst++ = 0;
   }

   if ( words >= 8 ) fjall(src, target, DMA_SRC_FIXED | (words & ~7));

#113833 - kusma - Sun Dec 31, 2006 3:32 am

keldon wrote:

Source address and destination address must be word aligned.

They are, since this is arrays of words ;)

keldon wrote:

Ahh:
Code:
   for (i = words & 7; i; --i)

Will only ever iterate from 7-1. Is that intended? This would explain why one value gets corrupted.


This is what is intended. And no, it does not explain it ;)


keldon wrote:

Code:
for (i = 0; i < 16; ++i)
   {
      buffer[i] = 1;
   }

Should be:
Code:
for (i = 0; i < 16; i++)
   {
      buffer[i] = 1;
   }


You are increasing i before it enters the loop.


...nope, they are equalient in this case. In general ++i is better programming practice in for-loops, because ++i does not return the old object, and hence does not need to take a copy of it. for integers, this doesn't really matter, but for data-structure iterators etc it does matter a lot.

keldon wrote:

Also if ( i & ~7) is equivalent to if (i >= 8).

I understand that your method, clear_buffer aims to clear the words that CpuFastSet does not. Beware of 'clever code' and anything that looks 'difficult'!

True, and I would like to point out that this isn't "production"-code, it's me trying to bang some sense into the problem ;)

keldon wrote:

Maybe change for:
Code:
   for (i = 0; i < ( words&7); i++ )
   {
      *dst++ = 0;
   }

   if ( words >= 8 ) fjall(src, target, DMA_SRC_FIXED | (words & ~7));


No cigar, still the same issues :(

Edit: I forgot to say thanks for atleast having a go at it, and for giving some good pointers... so thanks ;)

#113840 - Peter - Sun Dec 31, 2006 5:25 am

kusma wrote:

Code:

fjall(src, target, DMA_SRC_FIXED | (words & ~7));


DMA_SRC_FIXED should probably be 1<<24.

I tried a few variations and it seems to me that it is fjall that causes the problem. I added the noinline attribute to fjall's implementation:
Code:

void __attribute__((noinline)) fjall(const BUFFER_STORAGE void *src, void BUFFER_STORAGE *dst, int mode)
{
  ...
}

and then it worked no matter if clear_buffer gets inlined or not, the volatile keyword also didn't broke anything.

Removed noinline again and took a quick look with no$gba at it. The value in r3 is what gets displayed for the elements of buffer in the seconds printf loop when fjall failed. (i suck at thumb assember)

Maybe it has to do with r3, which, for example, gets clobbered but not restored again.

Just an idea.

#113849 - kusma - Sun Dec 31, 2006 11:38 am

Peter wrote:
DMA_SRC_FIXED should probably be 1<<24.

Yeah, it is. It's the normal define from libgba.

Peter wrote:
Maybe it has to do with r3, which, for example, gets clobbered but not restored again.

Just an idea.


Yeah, it seems like r3 getting destroyed might be the problem here, but how come? It's in the clobber-list in the inline asm-statement. And beyond that, I have basically no control over the register allocation and clobbering...

[edit:] Hmm, no... r3 gets destroyed because it's not being guaranteed to be kept alive by the calling convention (r0-r3 is for the first 4 parameters).

#113851 - kusma - Sun Dec 31, 2006 12:13 pm

Okay, I think I've found out exactly what goes wrong. It seems to me that the variable zero (allocated on the stack, sp+4) doesn't get initialized when not doing the little loop first. Here's the assembly generated for clear_buffer in my current case (the same as I posted, inlining of fjall but not clear_buffers, no volatile), only beautified a bit.

Code:

   .align   2
   .global   clear_buffer
   .code 16
   .thumb_func
   .type   clear_buffer, %function
clear_buffer:
   push   {r4, r5, r6, lr}
   mov   r3, #7   @ tmp114,
   mov   r5, r1   @ i, words
   sub   sp, sp, #8   @,,
   mov   r6, r0   @ dst, target
   and   r5, r3   @ i, tmp114
   beq   .skip_loop   @,

   /* setup registers for copy loop */
   mov   r3, #0   @ tmp116,                 /* NOTE THIS*/
   str   r3, [sp, #4]   @ tmp116, zero  /* sp+4 doesn't get initialized any other place! */
   mov   r2, r0   @ dst.66, dst
   mov   r3, r5   @ i.65, i
   mov   r4, #0   @ tmp117,

   /* copy loop */
.loop:
   sub   r3, r3, #1   @ i.65,
   stmia   r2!, {r4}   @, tmp117
   cmp   r3, #0   @ i.65,
   bne   .loop   @,
   
   lsl   r3, r5, #2   @ tmp118, i,
   add   r6, r0, r3   @ dst, target, tmp118

.skip_loop:
   mov   r3, #7   @ tmp119,
   bic   r1, r3   @ words, tmp119
   mov   r5, r1   @ D.1457, words
   beq   .done   @,
   mov   r3, #128   @ tmp121,
   lsl   r3, r3, #17   @ tmp121, tmp121,
   add   r4, sp, #4   @ tmp120,,
   orr   r5, r5, r3   @ tmp122, tmp121
   
   /* this is the beginning of my inline code */

   mov r0, r4     @ tmp120
   mov r1, r6     @ dst
   mov r2, r5    @ tmp122
   swi 0xC

   /* this is the end of my inline code */

   .code   16
.done:
   add   sp, sp, #8   @,,
   @ sp needed for prologue   @
   pop   {r4, r5, r6}
   pop   {r0}
   bx   r0
   .size   clear_buffer, .-clear_buffer


If i move the initin of sp+4 up a couple of lines (before the branch), it all works. Now, what on earth made the compiler forget to initialize that variable?

#113853 - Cearn - Sun Dec 31, 2006 1:13 pm

kusma wrote:

If i move the initin of sp+4 up a couple of lines (before the branch), it all works. Now, what on earth made the compiler forget to initialize that variable?

It's part of the optimizer somehow, something along the lines of "we don't ever actually use the contents of the location of zero anywhere, so it's okay to put its value in a register". Something like that, anyway. To fix, make the zero variable a volatile, so it's never optimized into a register:
Code:

   volatile u32 zero = 0;
   fjall(&zero, dst, DMA_SRC_FIXED | (words & ~7));

Or use something like this:
Code:

void CpuFastFill(u32 src, void *dst, u32 mode)
{
   /* call BIOS-function CpuFastSet() to clear buffer */
   asm (
      "push {%[src]}   \n"
      "mov r0, sp      \n"
      "mov r1, %[dst]  \n"
      "mov r2, %[mode] \n"
      "swi 0xC         \n"
      "add sp, #4      \n"
      : /* no output */
      : /* inputs */
         [src]  "r"(src),
         [dst]  "r"(dst),
         [mode] "r"(mode)
      : "r0", "r1", "r2", "r3" /* clobbers */
   );
}

// usage
CpuFastFill(0, dst, DMA_SRC_FIXED | (words & ~7));

#113854 - kusma - Sun Dec 31, 2006 1:23 pm

Cearn wrote:
kusma wrote:

If i move the initin of sp+4 up a couple of lines (before the branch), it all works. Now, what on earth made the compiler forget to initialize that variable?

It's part of the optimizer somehow, something along the lines of "we don't ever actually use the contents of the location of zero anywhere, so it's okay to put its value in a register". Something like that, anyway.

Well, an assembly-block that takes in a pointer to a variable may very well mess with the contents of that pointer, so I don't see how the compiler is allowed to do this optimization. Every variable that is taken a pointer to, is kept in memory. Atleast it seems so. And the variable zero seems to get the right address, it's just it's content that isn't initialized.

Cearn wrote:
To fix, make the zero variable a volatile, so it's never optimized into a register:
Code:

   volatile u32 zero = 0;
   fjall(&zero, dst, DMA_SRC_FIXED | (words & ~7));



OK, the work-around I came up with was simply making zero static const. That way it's not dynamically allocated on the stack, but rather put in the data-section. IMO that's cleaner in the first place.

Cearn wrote:

Or use something like this:
Code:

void CpuFastFill(u32 src, void *dst, u32 mode)
{
   /* call BIOS-function CpuFastSet() to clear buffer */
   asm (
      "push {%[src]}   \n"
      "mov r0, sp      \n"
      "mov r1, %[dst]  \n"
      "mov r2, %[mode] \n"
      "swi 0xC         \n"
      "add sp, #4      \n"
      : /* no output */
      : /* inputs */
         [src]  "r"(src),
         [dst]  "r"(dst),
         [mode] "r"(mode)
      : "r0", "r1", "r2", "r3" /* clobbers */
   );
}

// usage
CpuFastFill(0, dst, DMA_SRC_FIXED | (words & ~7));



Yes, well, that's one of the things I wanted to avoid in the first place. ;)

Anyway, thanks for the help so far. I have a somewhat robust work-around, but I'd still like to hear an explanation to why the compiler does this in the first place, if anyone knows.

#113855 - Peter - Sun Dec 31, 2006 1:57 pm

kusma wrote:
Peter wrote:
DMA_SRC_FIXED should probably be 1<<24.

Yeah, it is. It's the normal define from libgba.

From my devkitpro installation /libgba/include/gba_dma.h, line 66:
Code:

#define DMA_SRC_FIXED   (2<<23)

Strange, I have probably an old version or something :)


Last edited by Peter on Sun Dec 31, 2006 2:02 pm; edited 1 time in total

#113856 - kusma - Sun Dec 31, 2006 2:00 pm

Peter wrote:
kusma wrote:
Peter wrote:
DMA_SRC_FIXED should probably be 1<<24.

Yeah, it is. It's the normal define from libgba.

From my devkitpro installation /libgba/include/gba_dma.h, line 66:
Code:

#define DMA_SRC_FIXED   (2<<23)

Strange :)


no, not strange at all ;)

Code:
>>> (2 << 23) == (1 << 24)
True


2 = 1 << 1
so
2 << 23 = (1 << 1) << 23 = 1 << 24


Last edited by kusma on Sun Dec 31, 2006 2:04 pm; edited 1 time in total

#113857 - Peter - Sun Dec 31, 2006 2:03 pm

kusma wrote:

Code:
>>> (2 << 23) == (1 << 24)
True

Correct. I didn't think it through :)