gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

C/C++ > memcpy

#176613 - GLaDOS - Sat Aug 27, 2011 2:54 am

I am having lag problems with my game, so I am trying to optimize it. In the profiling I did, the top result appears to be part of the memcpy function. How can I tell where memcpy is being called from? How can I tell when it is possible to optimize?

#176615 - Dwedit - Sat Aug 27, 2011 4:11 am

I saw that earlier. Implicit calls to Memcpy are added whenever you assign a struct or class. If you overload the = operator, it won't call memcpy.

Also, make sure your structs and classes that are copied are padded out to a multiple of 4-bytes in length, otherwise they use the slow memcpy for unaligned data.

Look in the disassembly for popular "BL" instructions for memcpy. (BL = branch and link, it's the ARM's call instruction)
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176616 - GLaDOS - Sat Aug 27, 2011 5:53 am

The most frequent call to memcpy appears to be from the function SetFizzler, which is one of the functions responsible for updating sprite data. The asm doesn't seem to bear any resemblance to the original function, so I'm still trying to figure out what the deal is.

Code:
void SetFizzler(s16 x, s16 y, u32 frame)
{
    for(s32 top=-32; top<32; top+=8)
    {
        AddSprite(x-8, y+top, FIZZLER, ATTR0_WIDE | ATTR0_REG, ATTR1_SIZE1 | ATTR1_HFLIP, frame*2);
        AddSprite(x-8, y+top, FIZZLER, ATTR0_WIDE | ATTR0_REG, ATTR1_SIZE1 | ATTR1_VFLIP, frame*2);
    }
}


Code:
000039E4 08004f5c <_Z10SetFizzlerssj>:
000039E4  8004f5c:   e92d47f0    push   {r4, r5, r6, r7, r8, r9, sl, lr}
000039E4  8004f60:   e1a01801    lsl   r1, r1, #16
000039E4  8004f64:   e1a06800    lsl   r6, r0, #16
000039E4  8004f68:   e2466702    sub   r6, r6, #524288   ; 0x80000
000039E4  8004f6c:   e1a02882    lsl   r2, r2, #17
000039E4  8004f70:   e2414602    sub   r4, r1, #2097152   ; 0x200000
000039E4  8004f74:   e2818602    add   r8, r1, #2097152   ; 0x200000
000039E4  8004f78:   e24dd010    sub   sp, sp, #16
000039E4  8004f7c:   e1a07822    lsr   r7, r2, #16
000039E4  8004f80:   e1a04824    lsr   r4, r4, #16
000039E4  8004f84:   e1a08828    lsr   r8, r8, #16
000039E4  8004f88:   e1a06846    asr   r6, r6, #16
000039E4  8004f8c:   e3a0aa01    mov   sl, #4096   ; 0x1000
000039E4  8004f90:   e3a09a02    mov   r9, #8192   ; 0x2000
0001CF20  8004f94:   e1a05804    lsl   r5, r4, #16
0001CF20  8004f98:   e59f1074    ldr   r1, [pc, #116]   ; 8005014 <_Z10SetFizzlerssj+0xb8>
0001CF20  8004f9c:   e3a02004    mov   r2, #4
0001CF20  8004fa0:   e1a05845    asr   r5, r5, #16
0001CF20  8004fa4:   e28d0008    add   r0, sp, #8
0001CF20  8004fa8:   eb00096a    bl   8007558 <memcpy>
0001CF20  8004fac:   e2844008    add   r4, r4, #8
0001CF20  8004fb0:   e3a03901    mov   r3, #16384   ; 0x4000
0001CF20  8004fb4:   e1a01005    mov   r1, r5
0001CF20  8004fb8:   e59d2008    ldr   r2, [sp, #8]
0001CF20  8004fbc:   e1a00006    mov   r0, r6
0001CF20  8004fc0:   e58da000    str   sl, [sp]
0001CF20  8004fc4:   e58d7004    str   r7, [sp, #4]
0001CF20  8004fc8:   e1a04804    lsl   r4, r4, #16
0001CF20  8004fcc:   ebfffde4    bl   8004764 <_ZL9AddSpritess12FramePalDatattt>
0001CF20  8004fd0:   e59f103c    ldr   r1, [pc, #60]   ; 8005014 <_Z10SetFizzlerssj+0xb8>
0001CF20  8004fd4:   e3a02004    mov   r2, #4
0001CF20  8004fd8:   e28d000c    add   r0, sp, #12
0001CF20  8004fdc:   eb00095d    bl   8007558 <memcpy>
0001CF20  8004fe0:   e1a04824    lsr   r4, r4, #16
0001CF20  8004fe4:   e1a00006    mov   r0, r6
0001CF20  8004fe8:   e1a01005    mov   r1, r5
0001CF20  8004fec:   e59d200c    ldr   r2, [sp, #12]
0001CF20  8004ff0:   e3a03901    mov   r3, #16384   ; 0x4000
0001CF20  8004ff4:   e58d9000    str   r9, [sp]
0001CF20  8004ff8:   e58d7004    str   r7, [sp, #4]
0001CF20  8004ffc:   ebfffdd8    bl   8004764 <_ZL9AddSpritess12FramePalDatattt>
0001CF20  8005000:   e1540008    cmp   r4, r8
0001CF20  8005004:   1affffe2    bne   8004f94 <_Z10SetFizzlerssj+0x38>
000039E4  8005008:   e28dd010    add   sp, sp, #16
000039E4  800500c:   e8bd47f0    pop   {r4, r5, r6, r7, r8, r9, sl, lr}
000039E4  8005010:   e12fff1e    bx   lr
00000000  8005014:   08018820    stmdaeq   r1, {r5, fp, pc}

#176617 - Dwedit - Sat Aug 27, 2011 7:13 am

AddSprite is being inlined. Anything can get inlined, whether you declare it that way or not.

Also, avoid using 16-bit local variables if possible. The compiler often needs to add extra code to bit-mask 32-bit numbers down to 16-bit. You can see that GCC added left shifts, did the arithmetic, then shifted right just so it can stay a 16-bit variable. The compiler won't do that if you use a 32-bit int. 8-bit local variables also have the same problem.

But once variables leave the registers and go into RAM, 16-bit is fine. Feel free to use 16-bit fields in a class or struct, or 16-bit arrays. But for temporary variables, use ints.

Edit:
Wow, memcpy with size 4? WTF. Is there a char array somewhere?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176618 - Cearn - Sat Aug 27, 2011 8:47 am

To expand on what Dwedit said: there are about a dozen shifts in the ASM code that are purely the result of not using 32-bit variables as parameters. In small functions, that can be quite a lot.

Also, this looks like ARM code, and very unoptimized ARM code at that. Even at -O1 the compier should know how to combine 0x1000 and 0x2000. What exactly are your compiler flags? Compile with '-mthumb -mthumb-interworking' and with either '-O2' or '-O3' might go a long way. The template makefiles should already be set up for this.

The memcpy's could be the result of assigning structs that aren't aligned to 4 bytes. If your sprite struct is simply four u16, a memcpy would be the result. See here for an example.

#176619 - GLaDOS - Sat Aug 27, 2011 12:10 pm

Here are my compile options. I thought I already had -O2 on, but maybe I didn't do it properly?
Code:
ARCH   :=   -marm

CFLAGS   :=   -g -Wall -O2\
      -mcpu=arm7tdmi -mtune=arm7tdmi\
       -fomit-frame-pointer\
      -ffast-math \
      $(ARCH)

CFLAGS   +=   $(INCLUDE)

CXXFLAGS   :=   $(CFLAGS) -fno-rtti -fno-exceptions -std=c++0x


Also, I was using 16bit variables there because I figured the sprite fields were 16bit anyway, but I guess that wasn't a good idea. I'll switch them to 32 bit and see if that helps.

As for alignment, I've added an alignment statement but it made no discernible difference.
Code:
struct OBJ_ATTR
{
    u16 attr0;
    u16 attr1;
    u16 attr2;
    s16 fill;
} __attribute__(( aligned(4) )) ;

#176620 - Ruben - Sat Aug 27, 2011 4:44 pm

Since this code is running from ROM, I strongly suggest you use THUMB code over ARM code. ROM has a 16-bit bus, meaning that to read ARM code (32-bit opcodes) it takes two fetches which can affect speed dramatically.

Secondly, try to use 32-bit variables wherever possible. If the variables absolutely *must* be something else, then make the variable 32-bit and follow these rules:

-Addition/subtraction = Leave it as it is
-Multiplication/division = Shift left+shift right with 32-n [where n is the bit count] [alternatively, use as normal but make casts to the type before multiplication/division].

Also, you don't need to have that particular struct explicitly aligned:
sizeof(s16/u16) = 2. You have four of them. 2*4 = 8, which is word-aligned.

Finally, try to use a custom memcpy routine such as Cearn's and place it in IWRAM. The one provided by gcc is placed in ROM (ie, slow) and can be quite nasty when it doesn't meet its exact criteria which, IIRC, means that data must be aligned to 32 bytes for it to do 8 word copies (ie, fast).
_________________
I'm 18 and have Asperger's, so if I don't get something at first, just bear with me. *nod*

#176621 - GLaDOS - Sat Aug 27, 2011 4:53 pm

Well during one of the early optimization rounds, I found that switching from thumb to arm provided a large speedup. I haven't tested recently but I assume it's still true.

I guess the optimal solution would be to figure out which pieces of code should be arm and which should be thumb, but I have no idea how to do that.

#176623 - Ruben - Sat Aug 27, 2011 6:52 pm

As a rule of thumb [no pun intended =P]:
ROM = THUMB code. Main program logic.
IWRAM = ARM code. Slow-ish algorithms (such as memory copies, sound mixing, maths code, sorting, etc).

When putting code in IWRAM, though, be sure to use the long call attribute on the function declaration.
_________________
I'm 18 and have Asperger's, so if I don't get something at first, just bear with me. *nod*

#176624 - GLaDOS - Sat Aug 27, 2011 7:03 pm

I changed all the 16bit variables to 32bit, giving a slight but measureable speed increase. (Somewhere around 3-5 scanlines).
The problem is that the sprite drawing code was already only a small portion of the time. Which is kind of strange since it showed up so high on the profiling.


Anyway, is there a way to find out why the compiler isn't doing the optimizations it should? I've got a bunch of templated code written on the assumption that the compiler would optimize away the redundant checks, so if it isn't doing that, it could be a very large source of slowdown.


Last edited by GLaDOS on Sat Aug 27, 2011 7:05 pm; edited 1 time in total

#176625 - Ruben - Sat Aug 27, 2011 7:04 pm

Can you upload a binary? .elf or .gba is fine.
I'll see if I can find what uses so much CPU.
_________________
I'm 18 and have Asperger's, so if I don't get something at first, just bear with me. *nod*

#176626 - GLaDOS - Sat Aug 27, 2011 7:07 pm

How should I send it to you? Email?

#176627 - Ruben - Sat Aug 27, 2011 7:08 pm

E-mail or filesharing site is fine (preferably MediaFire).
_________________
I'm 18 and have Asperger's, so if I don't get something at first, just bear with me. *nod*

#176628 - GLaDOS - Sat Aug 27, 2011 7:16 pm

http://www.mediafire.com/?8nk688acp96gprz

#176629 - Ruben - Sat Aug 27, 2011 7:21 pm

Alright, I can't get it to work. It seems that you load a null pointer at some point [immediately after the transition effect] which makes the CPU jump into an infinite loop.

EDIT:
It occurs at a function at 0800504C. Try using this to find what it is:
arm-eabi-nm -S -n PortalAdvance.elf > map.txt
It should dump every function location into order.
_________________
I'm 18 and have Asperger's, so if I don't get something at first, just bear with me. *nod*


Last edited by Ruben on Sat Aug 27, 2011 7:23 pm; edited 1 time in total

#176630 - GLaDOS - Sat Aug 27, 2011 7:22 pm

It works fine for me on VBA-M (956) on Windows.

#176633 - GLaDOS - Sat Aug 27, 2011 9:29 pm

It looks like a built in function.

Code:
08005020 T __aeabi_ldivmod
08005064 00000044 T __gnu_ldivmod_helper
080050a8 00000040 T __gnu_uldivmod_helper

#176634 - Ruben - Sat Aug 27, 2011 9:46 pm

Ah, I see.
The problem there is that the behaviour is unpredictable.
Code:
stmfd sp!, {sp,lr}

Basically, one can't know if it will store the stack pointer before or after changing it. I guess no$ doesn't interpret it correctly.

After patching that to use the BIOS division, I was able to get it running... ish.

The first thing I noticed is that you're polling V-Count. It is a lot better to use swi_VBlankIntrWait as this puts the CPU into low-power mode and allows one to properly see how much CPU is being used. To use that, though, you'll have to setup an interrupt routine (which I don't think you have).

So give me a few minutes to patch that part...

EDIT:
Alright. It appears that CPU usage starts at about 30-40% then increases until it saturates and the whole thing comes to a screeching halt.
My guess is that you're adding elements to a list that is processed but you never remove them.

#176635 - Dwedit - Sat Aug 27, 2011 11:37 pm

Are you using NO$GBA? NO$GBA doesn't emulate the stmfd sp!,{sp} instruction correctly, and Wintermute refuses to accept patches to Libgcc which would remove that instruction.
Fork time?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176636 - GLaDOS - Sat Aug 27, 2011 11:43 pm

^^ Uh, I don't do any dynamic memory allocations at all. I have no idea what you're talking about.

Furthermore, the cpu usage per tick should not change at all unless the game state changes. If you just sit there at the start it should be near constant.

#176641 - wintermute - Sun Aug 28, 2011 10:28 am

Dwedit wrote:
Are you using NO$GBA? NO$GBA doesn't emulate the stmfd sp!,{sp} instruction correctly, and Wintermute refuses to accept patches to Libgcc which would remove that instruction.
Fork time?


Why would I accept patches to the toolchain that work around bugs in emulators?

If code works on hardware there is no bug to fix.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#176642 - wintermute - Sun Aug 28, 2011 10:41 am

Ruben wrote:
Ah, I see.
The problem there is that the behaviour is unpredictable.
Code:
stmfd sp!, {sp,lr}

Basically, one can't know if it will store the stack pointer before or after changing it. I guess no$ doesn't interpret it correctly.


sp is r13, lr is r14

ARM Architecture Reference Manual wrote:

If <Rn>is the lowest numbered register specified in <register_list>. the original value of <Rn> is stored. Otherwise the stored value of <Rn> is unpredictable.


It's a NO$GBA cpu bug
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#176643 - Ruben - Sun Aug 28, 2011 10:46 am

wintermute wrote:
Why would I accept patches to the toolchain that work around bugs in emulators?

no$ debug ;)

wintermute wrote:
It's a NO$GBA cpu bug

Fair enough - I figured, anyway. Why not just incorporate a bunch of libc functions into libgba/libnds to speed things up (ie, make an alias of division as __aeabi_ldivmod, etc) and get rid of these problems?

#176644 - wintermute - Sun Aug 28, 2011 11:29 am

1. There are absolutely no circumstances that will convince me modifying a toolchain based on the behaviour of an emulator is correct.
2. It's not that simple
3. devkitARM is more than just a GBA/NDS toolchain
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog