gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > Extra calls to memcpy - how do I get rid of these?

#134801 - simonjhall - Tue Jul 17, 2007 12:08 am

Me again ;-)
So I've replaced as many strbs and all the funny variants of strb from my code, removed all the memcpys, memsets, strcpys, strncpys etc etc...but I'm still getting dodgy results.

My problems begin on one function which has no strbs, yet runs fine from normal memory. Here's the guts of the function (I have cut a lot away to replicate this funny condition btw):
Code:
void Mod_LoadPlanes (lump_t *l)
{
   int         i, j;
   mplane_t   *out;
   dplane_t    *in;
   int         count;
   int         bits;

   for ( i=0 ; i<count ; i++)
   {
      for (j=0 ; j<3 ; j++)
      {
         out->normal[j] = LittleFloat (in->normal[j]);
      }

   }
}
Again, it doesn't function correctly, but this code here will still generate weirdness.
The prototype for LittleFloat is
Code:
extern   float   (*LittleFloat) (float l);
ie, it's a function pointer so gets called through bx to a register.
Normal is of type fixed_point, and there's operator overloading which converts from one to the other. The fixed_point type is four bytes in size.

Yet the disassembly looks like this:
Code:
void Mod_LoadPlanes (lump_t *l)
201bbb4:       e92d4070        stmdb   sp!, {r4, r5, r6, lr}
201bbb8:       e3a06000        mov     r6, #0  ; 0x0
201bbbc:       e24dd008        sub     sp, sp, #8      ; 0x8
201bbc0:       ea000013        b       201bc14 <_Z14Mod_LoadPlanesP6lump_t+0x60>
201bbc4:       e3a05000        mov     r5, #0  ; 0x0
201bbc8:       e7950004        ldr     r0, [r5, r4]
201bbcc:       e59f3054        ldr     r3, [pc, #84]   ; 201bc28 <.text+0x1b9e8>
201bbd0:       e593c000        ldr     ip, [r3]
201bbd4:       e1a0e00f        mov     lr, pc
201bbd8:       e12fff1c        bx      ip
201bbdc:       e28d4004        add     r4, sp, #4      ; 0x4
201bbe0:       e1a01000        mov     r1, r0
201bbe4:       e59f3040        ldr     r3, [pc, #64]   ; 201bc2c <.text+0x1b9ec>
201bbe8:       e1a00004        mov     r0, r4
201bbec:       e1a0e00f        mov     lr, pc
201bbf0:       e12fff13        bx      r3
201bbf4:       e0850004        add     r0, r5, r4
201bbf8:       e1a01004        mov     r1, r4
201bbfc:       e2855004        add     r5, r5, #4      ; 0x4
201bc00:       e3a02004        mov     r2, #4  ; 0x4
201bc04:       eb012e41        bl      2067510 <memcpy>  <---- memcpy for four bytes!
201bc08:       e355000c        cmp     r5, #12 ; 0xc
201bc0c:       1affffed        bne     201bbc8 <_Z14Mod_LoadPlanesP6lump_t+0x14>
201bc10:       e2866001        add     r6, r6, #1      ; 0x1
201bc14:       e1560004        cmp     r6, r4
201bc18:       baffffe9        blt     201bbc4 <_Z14Mod_LoadPlanesP6lump_t+0x10>
201bc1c:       e28dd008        add     sp, sp, #8      ; 0x8
201bc20:       e8bd4070        ldmia   sp!, {r4, r5, r6, lr}
201bc24:       e12fff1e        bx      lr
201bc28:       020ac70c        andeq   ip, sl, #3145728        ; 0x300000
201bc2c:       02076114        andeq   r6, r7, #5      ; 0x5

Using a bit of objdump and nm tells me that 20ac70c is the function pointer LittleFloat, so the ldr followed by the bx (at 201bbd0) does the call to LittleFloat.
2076114 (the target of the second bx) is the function which promotes float types to my fixed_point class (size 4 bytes). This fixed_point class is then stored in normal[j] (normal is of type *fixed_point).

So (if that made any sense)...what's with the four-byte memcpy? This is the code generated with -Os - I don't get the memcpy if I compile it without the option.

I wouldn't normally care about these memcpys too much, but they are breaking my slot-2 shenanigans.

So to reiterate:
- how do I get rid of four-byte memcpys? The whole point of me of doing the floating/fixed point stuff was to make it fast - extra code ain't gonna help
- if I can't get rid of the memcpy, how can I tell it to use a different (ie my) memcpy? I could replace the first instruction of memcpy with a branch to my memcpy, by that's a bit hacky.

Ta all.

PS: if there are mistakes, it's cos I'm tired!
_________________
Big thanks to everyone who donated for Quake2

#134803 - kusma - Tue Jul 17, 2007 12:38 am

Weird. I've tried to replicate the issue based on what you're reporting here, and I can't get it to generate a memcpy. Could you give the complete set of build-flags?

#134804 - PeterM - Tue Jul 17, 2007 12:54 am

Since the DS is little endian and the code is probably quite far from being portable now, would it help to rip out all the LittleX function pointers and replace them with dummy inline functions or macros?
_________________
http://aaiiee.wordpress.com/

#134838 - elhobbs - Tue Jul 17, 2007 2:42 pm

do you need to override the assigment operator for your fixed point class(operator=) ?

#134877 - simonjhall - Tue Jul 17, 2007 10:12 pm

kusma wrote:
Weird. I've tried to replicate the issue based on what you're reporting here, and I can't get it to generate a memcpy. Could you give the complete set of build-flags?
I can't seem to find a similar problem in older builds (in that function, it may just happen elsewhere). In this build the float->fixed overloading is bx'ed, then memcpy'd. In older builds the float->fixed is inlined and no memcpy is used.
This happens regardless of doing it in thumb or arm...

The build line I'm using is pretty much what's in the makefile you have.
Quote:
-Dstricmp=strcasecmp -I "c:/devkitPro/libnds/include" -I "c:/devkitPro/libnds/include/nds" -DARM9 -mthumb-interwork -fno-rtti -fno-exceptions -g -Os -mtune=arm9

Quote:
do you need to override the assigment operator for your fixed point class(operator=) ?
Maybe..I'll have a go...
Quote:
would it help to rip out all the LittleX function pointers
Probably - but the thing I'm worried about is, if it can happen here, where else is it happening?

Hmm ;-)

EDIT: kusma, it happens in just three other places (bar model loading) - check out r_alias.c, line 413 (fixed point -> fixed point store), r_aclip.c, line 266 (float -> float store) and line 253 (float->float). Obv the source may have changed a little bit but just objdump the functions those line numbers lie in and search for bls to memcpy :-)
_________________
Big thanks to everyone who donated for Quake2

#134883 - kusma - Tue Jul 17, 2007 11:08 pm

To be honest, I only tried to reproduce it stand-alone. Perhaps I'll dig into the sources and give it a go. But now I feel like sleeping ;)

#134884 - Dwedit - Tue Jul 17, 2007 11:18 pm

memcpy does 32-bit writes as long as the source, destination, and number of bytes to copy are all word aligned or multiples of 4.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#134885 - Lazy1 - Tue Jul 17, 2007 11:31 pm

A little off topic, but why -Os instead of -O2 or -O3?
If the correct code is generated with -O2, is there any advantage to smaller code size?

#134886 - simonjhall - Tue Jul 17, 2007 11:51 pm

Lazy1 wrote:
A little off topic, but why -Os instead of -O2 or -O3?
If the correct code is generated with -O2, is there any advantage to smaller code size?
The correct code was being generated with no -O at all - I haven't tried O2/3 yet. But anyway, I've always used -Os and thumb due to the complete lack of memory. But now I don't need to cos I've got THIRTY TWO MEGS EXTRA MEMORY!
Can you tell I've just got it working? Time to dig out the old thread again ;-)

BTW: I objdump/grep/addr2lined the elf, looking for memcpys and manually sorting out the cases where it happens (not the most robust of methods!)
_________________
Big thanks to everyone who donated for Quake2

#134887 - Dwedit - Wed Jul 18, 2007 12:08 am

-O2 pads the code with nops due to some idea that aligning branch targets to 8 byte boundaries improves performance. It might on some systems, but probably not on this one. -Os does not add padding.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#134889 - DekuTree64 - Wed Jul 18, 2007 12:12 am

Yay!

Probably would be a good idea to check for similar problems with memset too. I've had the compiler generate calls to that for things like initializing local structs to 0.

Also, have you gotten cache to work on the expanded memory? I remember someone was saying that you can't cache GBA cart space a while back, but I never got around to testing it.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#134893 - masscat - Wed Jul 18, 2007 1:02 am

Looking at the GCC docs, it appears that GCC will happily generate calls to memcpy, memmove, memset and memcmp, expecting these to be provided externally.

You could replace those functions in libg.a with your own GBA slot safe versions. Something like:

Code:
ar dv libg.a lib_a-memcpy.o lib_a-memset.o lib_a-memcmp.o lib_a-memmove.o
ar rs libg.a custom_fns_obj_file.o