gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > ASM

#77351 - ProblemBaby - Thu Mar 30, 2006 3:57 pm

Hello

I want to optimize some code. First Ive to ask about the arm7...
All arm7 code have to be putted in some kind of IWRAM do this mean that all arm7 code is in arm-mode? if not is it like for the GBA some memory that executes arm code better?

And Ive the same question for arm9, where should i put arm code to get the best performence?

Last I just want to make sure that this is the correct values to measure performence (in percent with two decimals) so Usage = 100 means 1%

Code:

REG_TM3D = 0;
REG_TM3CNT = 0x81;

// Some code

Usage = REG_TM3D;
REG_TM3CNT = 0;
Usage = (((Usage<<6)*10000) / 1123584);

#77363 - acox - Thu Mar 30, 2006 5:39 pm

ProblemBaby wrote:

All arm7 code have to be putted in some kind of IWRAM do this mean that all arm7 code is in arm-mode? if not is it like for the GBA some memory that executes arm code better?


Your ARM7 code can be ARM or thumb as you like. The problem with the 4MB EWRAM on the DS is that only one processor can access it at a time. It has to be locked. So the default is to put all ARM 7 code in the 64KB IWRAM.
You can see this here:
devkitPro/devkitARM/arm-elf/lib/ds_arm7.ld

I don't think there is anywhere even better you can put your ARM7 code.

Quote:

And Ive the same question for arm9, where should i put arm code to get the best performence?


Since we have a cache now on DS, maybe the default is fine. But there is a fast chunk of memory for code: the ITCM. Use section itcm to get your code in there.

There is also a fast place for data, called DTCM and accessed with the dtcm section similarly.
_________________
3D on GBA

#77389 - ProblemBaby - Thu Mar 30, 2006 11:08 pm

Thanks for the answer, just want to make sure...
1.
do the 64kb IWRAM for the arm7 executes arm code faster then thumb code?

2. is ITCM and DTCM only accessable by the arm9?

3. When using DTCM or (any kind of RAM). is it possible to allocate and free in asm? I need a quite big and fast accesable buffer.

#77390 - DekuTree64 - Thu Mar 30, 2006 11:17 pm

IWRAM has a 32-bit bus, so access time for ARM and THUMB instructions is the same, and ARM can generally do more work in the same number of instructions so it will win.

For ARM9, the cache does help a lot with running ARM code from main RAM, but main RAM is so slow it may still be faster to use THUMB anyway. I haven't done any profiling to verify though, and all the games I've worked on used ARM pretty much exclusively.

EDIT: Dang it, you edited your post while I was typing :)

2. Yes, ITCM and DTCM are ARM9-only. Also, you can store data in ITCM if you want, but executing and loading from it at the same time will stall one cycle. You can't execute from DTCM, but using it solves the stalling problem.

3. Depends entirely on your engine. Normally the stack is in DTCM, so just creating a local array will allocate some of it. Or you can make a large global array in DTCM and write a memory manager to manage dynamic allocation in it. Or just use the array like a secondary stack by making a global variable that points into it, and add/subtract from the pointer to push/pop things.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku


Last edited by DekuTree64 on Thu Mar 30, 2006 11:25 pm; edited 1 time in total

#77391 - tepples - Thu Mar 30, 2006 11:19 pm

ProblemBaby wrote:
do the 64kb IWRAM for the arm7 executes arm code faster then thumb code?

ARM7 IWRAM, ARM9 ITCM, and ARM9 cached memory are all 32-bit 0-wait-state memories and will execute ARM instructions at the same rate as Thumb instructions. However, it usually takes more instructions to express a given algorithm with Thumb instructions than with ARM instructions, so you're better off going with ARM instructions for signal processing inner loops if you can afford the memory.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#77396 - ProblemBaby - Fri Mar 31, 2006 1:39 am

Thanks for the replies.

DekuTree64 wrote:

3. Depends entirely on your engine. Normally the stack is in DTCM, so just creating a local array will allocate some of it. Or you can make a large global array in DTCM and write a memory manager to manage dynamic allocation in it. Or just use the array like a secondary stack by making a global variable that points into it, and add/subtract from the pointer to push/pop things.


Local Stack, how is that done?
Then I've to ask is some nice way to make a global buffer without a forced main memory location that is accesable from both processors?

#77398 - DekuTree64 - Fri Mar 31, 2006 2:07 am

ProblemBaby wrote:
Local Stack, how is that done?

Code:
u8 *stack = topAddress;

void Monkey()
{
    u8 *arrayOf500Bytes;

    stack -= 500;   // Allocate space on the stack
    arrayOf500Bytes = stack;

    // ... Do stuff

    stack += 500;    // Free space from the stack
}

Although the real advantage is that you don't have to free the data at the end of the function. So like when entering a menu, you can claim some persistent variable space, and keep it until you exit the menu. malloc works too, but I like the simplicity and predictableness of a stack.

Usually I keep a stack like this in main RAM, so I can use it for any large local arrays in functions so I don't have to worry about the real stack overflowing. Would be good in DTCM too, so you can use run-time allocation without the overhead and fragmentation risk of a dynamic manager.

Quote:
Then I've to ask is some nice way to make a global buffer without a forced main memory location that is accesable from both processors?

The easiest way I know is to add variables to the IPC struct, or make a custom shared struct and place it right after IPC so they both know where it is. Then you can make a global array on one processor, and during startup store a pointer to it in the shared data for the other processor to check.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#77478 - HyperHacker - Sat Apr 01, 2006 6:12 am

DekuTree64 wrote:
The easiest way I know is to add variables to the IPC struct, or make a custom shared struct and place it right after IPC so they both know where it is.

What I do is just add "#ifndef IPC" and "#endif" around the IPC struct definition, then define my own in my header files. You have to keep some things for various code to work though.