gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

C/C++ > GCC is afraid of efficient memory access

#11654 - tepples - Wed Oct 15, 2003 4:35 am

I wasn't sure whether to put this in the C++ section or the ASM section, but here goes:

One limitation of gcc version 3.2.2 (DevKit Advance R5 Beta 3) is that it does *not* like to LDMIA from memory to variables, even when LDMIA would be obviously the best answer. Instead, it generates successive LDR instructions, wasting cycles on address generation.

Test case: Compile this to assembly language with
arm-agb-elf-gcc -Wall -O3 -marm -mthumb-interwork -S ldmia.c -o ldmia.s
Code:

/* ldmia_test() **************************
   XORs successive pairs of ints in src, placing the result in dst.
   Reads 2*n ints (8*n bytes) from src and stores n ints (4*n bytes) to dst.
*/
void ldmia_test(unsigned int *dst, const unsigned int *src, unsigned int len)
{
  /* force variables into ascending-ordered registers */
  register unsigned int x;
  register unsigned int y asm("ip");

  for(; len > 0; len--)
  {
    /* GCC *should* emit an LDMIA instruction that pulls in
       both x and y.  */
    x = *src++;
    y = *src++;
    *dst++ = x ^ y;
  }
}

results in
Code:

   .file   "ldmia.c"
   .text
   .align   2
   .global   ldmia_test
   .type   ldmia_test,function
ldmia_test:
   @ Function supports interworking.
   @ args = 0, pretend = 0, frame = 0
   @ frame_needed = 0, uses_anonymous_args = 0
   @ link register save eliminated.
   cmp   r2, #0
   @ lr needed for prologue
   bxeq   lr
.L6:
   ldr   r3, [r1], #4
   ldr   ip, [r1], #4
   subs   r2, r2, #1
   eor   r3, r3, ip
   str   r3, [r0], #4
   bxeq   lr
   b   .L6
.Lfe1:
   .size   ldmia_test,.Lfe1-ldmia_test
   .ident   "GCC: (GNU) 3.2.2 (DevKit Advance R5 Beta 3)"


Can't this
Code:

   ldr   r3, [r1], #4
   ldr   ip, [r1], #4

be replaced with this?
Code:

   ldmia   ip!, {r3, ip}

And how would I get GCC to generate LDMIA instructions without resorting to inline assembly language? I still can't get my head around how to specify pre- and post-conditions of inline assembly.

I've also had problems coaxing GCC into using register FP to hold a variable.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#11656 - Burton Radons - Wed Oct 15, 2003 6:18 am

This is a code generation bug, since it does know how to do it:

Code:

void ldmia_test (unsigned int *dst, const unsigned int *src, unsigned int len)
{
    unsigned int x;
    unsigned int y;

    for(; len > 0; len--)
    {
        /* GCC *should* emit an LDMIA instruction that pulls in
        both x and y.  */
        x = src [0];
        y = src [1];
        *dst++ = x ^ y;
        src += 2;
    }
}


Doing this makes it generate the right code:

Code:

ldmia_test:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    cmp r2, #0
    @ lr needed for prologue
    moveq   pc, lr
.L536:
    ldmia   r1, {r3, ip}
    eor r3, r3, ip
    subs    r2, r2, #1
    str r3, [r0], #4
    add r1, r1, #8
    moveq   pc, lr
    b   .L536


I'll report it; however, I don't know if it can be fixed at the point the code generator gets it. It's clearly missing a pattern that skips certain garbage it doesn't like in the middle.

And just a note, if you want to force registers to multiple variables, explicitly indicate both of them because the allocation scheme might change. You so very much do not want this kind of bug in your code.

#11661 - Paul Shirley - Wed Oct 15, 2003 2:11 pm

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 8:49 pm; edited 1 time in total

#11682 - tepples - Wed Oct 15, 2003 8:50 pm

Thanks for all the tips.

I guess it's just a bug that GCC doesn't see
Code:

a = x[0];
x++;
b = x[0];
x++;

as equivalent to
Code:

a = x[0];
b = x[1];
x += 2;

instead making the second fragment faster.

This change shaved a few cycles off my draw function's inner loop. Another operation to avoid on ARM is repeatedly shifting a variable (e.g. bits >>= 8); instead, use bits, bits >> 8, bits >> 16, bits >> 24.

I've boiled down my findings to a general guideline: For fastest ARM code with GCC, make as few modifications to temporary variables as possible.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#11689 - poslundc - Thu Oct 16, 2003 12:16 am

tepples wrote:
I've boiled down my findings to a general guideline: For fastest ARM code with GCC, make as few modifications to temporary variables as possible.


That's useful advice. Were your tests strictly on ARM code, or Thumb as well?

Dan.

#11703 - tepples - Thu Oct 16, 2003 5:54 am

ARM only. I've never written a program that needed more than a couple kilobytes of heavily optimized ARM code, so I haven't had to resort to optimizing generation of Thumb code yet. I would guess that because of fewer registers and the lack of a shift during register-fetch phase, some of this may not apply.

That said, I did get my full-screen mode 4 environment mapper (your stereotypical "lens" or "tunnel" effect) up to 24 fps at 240x160 pixels on hardware. I could of course go to 60fps and beyond easily by halving horizontal resolution and interlacing vertically. (Beyond is useful when I want to add sound to a demo, or when I want to precompute stuff for the next shot.)
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.