gbadev.org forum archive

I wasn't sure whether to put this in the C++ section or the ASM section, but here goes:

One limitation of gcc version 3.2.2 (DevKit Advance R5 Beta 3) is that it does *not* like to LDMIA from memory to variables, even when LDMIA would be obviously the best answer. Instead, it generates successive LDR instructions, wasting cycles on address generation.

Test case: Compile this to assembly language with
arm-agb-elf-gcc -Wall -O3 -marm -mthumb-interwork -S ldmia.c -o ldmia.s

Code:

/* ldmia_test() **************************
XORs successive pairs of ints in src, placing the result in dst.
Reads 2*n ints (8*n bytes) from src and stores n ints (4*n bytes) to dst.
*/
void ldmia_test(unsigned int *dst, const unsigned int *src, unsigned int len)
{
/* force variables into ascending-ordered registers */
register unsigned int x;
register unsigned int y asm("ip");

for(; len > 0; len--)
{
/* GCC *should* emit an LDMIA instruction that pulls in
both x and y. */
x = *src++;
y = *src++;
*dst++ = x ^ y;
}
}

results in

Code:

.file "ldmia.c"
.text
.align 2
.global ldmia_test
.type ldmia_test,function
ldmia_test:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
cmp r2, #0
@ lr needed for prologue
bxeq lr
.L6:
ldr r3, [r1], #4
ldr ip, [r1], #4
subs r2, r2, #1
eor r3, r3, ip
str r3, [r0], #4
bxeq lr
b .L6
.Lfe1:
.size ldmia_test,.Lfe1-ldmia_test
.ident "GCC: (GNU) 3.2.2 (DevKit Advance R5 Beta 3)"

Can't this

Code:

ldr r3, [r1], #4
ldr ip, [r1], #4

be replaced with this?

Code:

ldmia ip!, {r3, ip}

And how would I get GCC to generate LDMIA instructions without resorting to inline assembly language? I still can't get my head around how to specify pre- and post-conditions of inline assembly.

I've also had problems coaxing GCC into using register FP to hold a variable.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

This is a code generation bug, since it does know how to do it:

Code:

void ldmia_test (unsigned int *dst, const unsigned int *src, unsigned int len)
{
unsigned int x;
unsigned int y;

for(; len > 0; len--)
{
/* GCC *should* emit an LDMIA instruction that pulls in
both x and y. */
x = src [0];
y = src [1];
*dst++ = x ^ y;
src += 2;
}
}

Doing this makes it generate the right code:

Code:

ldmia_test:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
cmp r2, #0
@ lr needed for prologue
moveq pc, lr
.L536:
ldmia r1, {r3, ip}
eor r3, r3, ip
subs r2, r2, #1
str r3, [r0], #4
add r1, r1, #8
moveq pc, lr
b .L536

I'll report it; however, I don't know if it can be fixed at the point the code generator gets it. It's clearly missing a pattern that skips certain garbage it doesn't like in the middle.

And just a note, if you want to force registers to multiple variables, explicitly indicate both of them because the allocation scheme might change. You so very much do not want this kind of bug in your code.

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 8:49 pm; edited 1 time in total

Thanks for all the tips.

I guess it's just a bug that GCC doesn't see

Code:

a = x[0];
x++;
b = x[0];
x++;

as equivalent to

Code:

a = x[0];
b = x[1];
x += 2;

instead making the second fragment faster.

This change shaved a few cycles off my draw function's inner loop. Another operation to avoid on ARM is repeatedly shifting a variable (e.g. bits >>= 8); instead, use bits, bits >> 8, bits >> 16, bits >> 24.

I've boiled down my findings to a general guideline: For fastest ARM code with GCC, make as few modifications to temporary variables as possible.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

tepples wrote:

I've boiled down my findings to a general guideline: For fastest ARM code with GCC, make as few modifications to temporary variables as possible.

That's useful advice. Were your tests strictly on ARM code, or Thumb as well?

Dan.

ARM only. I've never written a program that needed more than a couple kilobytes of heavily optimized ARM code, so I haven't had to resort to optimizing generation of Thumb code yet. I would guess that because of fewer registers and the lack of a shift during register-fetch phase, some of this may not apply.

That said, I did get my full-screen mode 4 environment mapper (your stereotypical "lens" or "tunnel" effect) up to 24 fps at 240x160 pixels on hardware. I could of course go to 60fps and beyond easily by halving horizontal resolution and interlacing vertically. (Beyond is useful when I want to add sound to a demo, or when I want to precompute stuff for the next shot.)
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

gbadev.org forum archive

C/C++ > GCC is afraid of efficient memory access

#11654 - tepples - Wed Oct 15, 2003 4:35 am

#11656 - Burton Radons - Wed Oct 15, 2003 6:18 am

#11661 - Paul Shirley - Wed Oct 15, 2003 2:11 pm

#11682 - tepples - Wed Oct 15, 2003 8:50 pm

#11689 - poslundc - Thu Oct 16, 2003 12:16 am

#11703 - tepples - Thu Oct 16, 2003 5:54 am