gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Coding > GBA Memory Management

#16943 - NitroSR - Fri Feb 27, 2004 7:28 pm

I would like to hear what sorts of solutions have proven to work nicely for a bit of a quandery I have regarding the useage of memory on the GBA. I am aware of how to go about specifying the various memory areas that I may place my code and data so that my linker properly distributes them upon program execution, but I cannot escape the feeling that I am potentially losing some vital control over my memory usage.

I imagine a situation where I have a series of ISR's resident in IWRAM for servicing a given state of my engine, and that these ISR's, given their complexity, eat up a lot of IWRAM. On top of this I have other bits of this memory being used. Now... the state of my engine changes and thus so does my suite of ISR's. I would like to be able know that I can swap my ISR code out of IWRAM and swap a new codebase from ROM without blowing things up!

I guess I am basically talking about setting up a mini operating system on a GBA that would allow me to selectively load code & data as I please without being restricted to the compile/link-time suite that I seem limited to at this time.

What are the DOs and DON'Ts regarding this sort of memory manipulation? Do you have any suggestions for me as to some documentation I should look into, or simply some portable console development philosophy. Maybe I'm thinking too big.

However, when I look at some of the commercial games out there and the variety of modes and such that they go through, I seriously doubt that the code in EWRAM and IWRAM is always constant from link time.

Take it with ease, everyone!

#16944 - poslundc - Fri Feb 27, 2004 8:00 pm

NitroSR wrote:
What are the DOs and DON'Ts regarding this sort of memory manipulation?


DO get your code working first.
DON'T optimize prematurely.

:)

32K of IWRAM isn't an awful lot, but executable code (ARM code, anyway, especially if you've hand-coded it in assembler) isn't likely to be the biggest consumer of it, and ISRs should be kept short anyway.

In my current project I have 8 routines running from IWRAM so far - only a couple of which are ISRs, and some fairly long and complicated - but altogether they only consume a little over 2K of IWRAM.

That said, there was recent talk in another thread about doing this kind of thing. There's no reason you can't DMA executable code from ROM to IWRAM; it seems to me that the tricky bit is figuring out how many instructions need to be DMA'ed from within C code.

So while there's nothing to prevent you from manually inserting the length of the object code once you've compiled it, I don't know how you would go about automating this.

Dan.

#16946 - tom - Fri Feb 27, 2004 8:17 pm

poslundc wrote:
That said, there was recent talk in another thread about doing this kind of thing. There's no reason you can't DMA executable code from ROM to IWRAM; it seems to me that the tricky bit is figuring out how many instructions need to be DMA'ed from within C code.


jeff frohwein's link script defines the iwram0-iwram9 and and ewram0-ewram9 sections, which are overlays.

the script creates the following symbols (shown only for iwram here):

__iwram_overlay_start, __iwram_overlay_end
this are the start and end adresses of the iwram overlay, that is, the area in iwram where you copy the overlays to.

__load_start_iwram0, __load_stop_iwram0,
__load_start_iwram1, __load_stop_iwram1,
__load_start_iwram2, __load_stop_iwram2,
__load_start_iwram3, __load_stop_iwram3,
__load_start_iwram4, __load_stop_iwram4,
__load_start_iwram5, __load_stop_iwram5,
__load_start_iwram6, __load_stop_iwram6,
__load_start_iwram7, __load_stop_iwram7,
__load_start_iwram8, __load_stop_iwram8,
__load_start_iwram9, __load_stop_iwram9,
this are the adresses of the overlay images in the rom.

to load overlay 0 you'd do the following:

memcpy(&__iwram_overlay_start, &__load_start_iwram0, &__load_stop_iwram0-&__load_start_iwram0);
(or use dma if you wish, but you get the idea...)

or so. seemed to work when i checked it last time (with dkar4)

#16952 - DekuTree64 - Fri Feb 27, 2004 9:36 pm

I tried overlays once and eventually gave up on making them work. I just used the DMA trick. You can declare an array of u32's the size of your function (one for each ARM instruction), and DMA to that, and until your function exits, that array will be on the stack, which is in IWRAM. Very handy if you have your game set up with a "main loop" function for each mode.
Otherwise you'll need to either use overlays or a memory manager. Rafael Baptista's manager is what I use for general stuff like that (Email me at dekutree64 AT hotmail.com if you can't find it anywhere). It's good for stuff like a little hunk of memory for sprites to store their movement scripts in, or a big area for battle mode, since it's a lot more dynamic and has things creating and destroying all the time, and it would be good for allocating blocks of code too.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#16956 - poslundc - Fri Feb 27, 2004 10:02 pm

No one seems to be able to addressed the problem I've raised, however, which is how to automatically determine the size of the object code you wish to copy.

The best way I can figure out how to do it is by using nm on the compiled code to calculate the size of my function, then manually inserting that value back into the routine that copies the code into IWRAM.

Dan.

#16958 - animension - Fri Feb 27, 2004 10:14 pm

How about this? First generate the C code you want to transfer to IWRAM into ASM using the -S flag in gcc, and then editing the resulting source to add a symbol right after the last relavent instruction or directive and then declaring it as global? This way you should be able to use that symbol as a memory address. Example:
Code:

.global functionname_endpoint
functionname:
mov r0,r1
sub r0,r0,#1

@ blah blah blah

bx lr
.pool
functionname_endpoint:
@ END OF FILE HERE


Granted, it's not automated since you'll have to do this everytime you recompile the code from C, but it should be a simple solution, and won't require line counting. You should be able to DMA the memory from &functionname to &functionname_endpoint, so the size would be (&functionname_endpoint - &functionname) rounded up to the nearest 4 to transfer 32-bitwise (2 bytes for 16-bitwise).
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin

#16960 - poslundc - Fri Feb 27, 2004 10:29 pm

I was thinking about something like this, except I don't like the idea of having to generate an intermediate file that I then have to manually modify and run through the assembler, which would obliterate my makefile's usefulness.

How about... doing it through some inline assembly at the end of a function? eg.

Code:
void myFunction(void)
{
   blah blah blah;
   more executable code;

   ...

   asm volatile (".global myFunctionEnd   \
            myFunctionEnd:");
}


Then add an extra two instructions to the difference (in ARM mode, anyway) to account for popping off any saved registers from the stack and returning to the calling routine.

Would this work, or would the compiler find a way to screw it up and not stick the label where it belongs?

Dan.

#16962 - animension - Fri Feb 27, 2004 10:35 pm

I'm not entire sure how the pool works, but doesn't it serve as a temp storage for addresses that are accessed outside of allowable immediate offsets? I'd be worried that the inline asm global directive you suggested might come before important things like the pool, etc. If so, wouldn't that royally screw up the function when it gets loaded without the pool data being at the offsets it expects it to be?
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin

#16963 - poslundc - Fri Feb 27, 2004 10:43 pm

Bah, you're probably right.

Dan.

#16964 - punchy - Fri Feb 27, 2004 10:56 pm

Take a look at the source code for the spectrum emulator Foon. It has runtime IWRAM mem allocation functions and runtime code loading to IWRAM too.

#16965 - torne - Fri Feb 27, 2004 11:12 pm

I'm awfully tempted to try implementing the cache manager I described in a previous thread (the one which would automatically copy functions to IWRAM on first use and would overwrite them according to some sensible policy); it would give similar effects (for some access patterns, not all) as having a hardware instruction cache. If it worked reasonably well for a given program then the entire program could be written in ARM assembler and allowed to run from the IWRAM cache, which could be faster than running Thumb (especially given GCC's poor Thumb generation).

Would this interest anyone? It would likely not work in many (perhaps most) cases, but *might* prove to be a benefit. It would certainly be much easier to *use* than overlays or manual copying of functions as it would be all but transparent to the programmer.

T.

#16966 - poslundc - Fri Feb 27, 2004 11:26 pm

...

I'm not saying it's a bad idea; I just think it's an awful lot of work to implement a system that would be hard to make optimal. The tradeoff between the cost of transferring the instructions into memory and having a larger cache with fewer misses would be very difficult to fine-tune. I don't know how you would inexpensively trap branches that cause a miss, either.

Dan.

#16967 - tom - Fri Feb 27, 2004 11:26 pm

poslundc wrote:
Code:
void myFunction(void)
{
   blah blah blah;
   more executable code;

   ...

   asm volatile (".global myFunctionEnd   \
            myFunctionEnd:");
}


Would this work, or would the compiler find a way to screw it up and not stick the label where it belongs?


this might work better. i still prefer overlays over such hacks, though =)

Code:

void myFunction(void) {
   blah blah blah;
   more executable code;

   ...
}
void myFunction_end(void) {}

#16968 - animension - Fri Feb 27, 2004 11:37 pm

I've been thinking about the overlays, and I think I understand how to make it work, but correct me if I am mistaken:
1) define a function to belong to section "iwram0" (up to 9) a la
Code:

void __attribute__((section(".iwram0"), long_call)) myfunc(void) ;

2) wherever needing to copy overlay-able function to the overlay region, do:
Code:

extern int __iwram_overlay_start;
extern int __load_start_iwram0, __load_stop_iwram0;
extern int __load_start_iwram1, __load_stop_iwram1;

// etc etc

memcpy(&__iwram_overlay_start, &__load_start_iwram0, &__load_stop_iwram0-&__load_start_iwram0);
// or DMA would work too

So then, the linker script would set a side a chunk of IWRAM the maximum size of all functions declared as section .iwram0 ~ 9 and generate symbols for __load_start_iwram0 ~ 9 and __load_stop_iwram0 ~ 9.

Is this correct?
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin

#16969 - poslundc - Fri Feb 27, 2004 11:37 pm

Is it safe to assume that the compiler will put the label for the next function immediately following the previous function?

And yes, it's hacky, but this is hacky stuff we're trying to do. :P

EDIT: I was talking about the post before animension's... also, to animension: why not just declare a global array whatever size you want your instruction cache to be?

Dan.

#16970 - animension - Fri Feb 27, 2004 11:45 pm

poslundc wrote:

to animension: why not just declare a global array whatever size you want your instruction cache to be?

That had crossed my mind, although it would be nice to be able to minimize the size of the cache based on the size of the object code. But, one can't always have their cake and eat it too. =)

It would suffice to declare a static global array of say 4KB and then simply make sure that the code is not larger than that, and increase it as needed when the code wigs out because it's larger than the cache.
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin

#16971 - tom - Fri Feb 27, 2004 11:47 pm

animension: yes, the size of the largest of the iwram0-iwram9 sections is the size of the memory reserved in iwram for these overlays.

poslundc wrote:
Is it safe to assume that the compiler will put the label for the next function immediately following the previous function?


gcc seems to behave like this. can't tell how *safe* it is to assume, though. another reason for me to use the overlay mechanism i described above and let the linker sort it out for me.

#16973 - Paul Shirley - Sat Feb 28, 2004 12:57 am

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 9:09 pm; edited 1 time in total

#16991 - torne - Sat Feb 28, 2004 2:02 pm

poslundc wrote:
I'm not saying it's a bad idea; I just think it's an awful lot of work to implement a system that would be hard to make optimal. The tradeoff between the cost of transferring the instructions into memory and having a larger cache with fewer misses would be very difficult to fine-tune. I don't know how you would inexpensively trap branches that cause a miss, either.


The tradeoff wouldn't be between transfer cost and cache size; it would be able to transfer single functions at a time (or several contiguous ones, if the linker and cache mananger were smart enough and had been given hints) and thus it shouldn't be a problem to make the cache the entirety of IWRAM (except any space you need for globals). Trapping branches is easy and there are several ways to do it; you could make all function symbols point into a table of branch instructions in IWRAM and rewrite the branches as you load/unload stuff from the cache. This will add about a three-cycle penalty to cache hits (extra cycle to make a long jump to IWRAM, one cycle for the extra branch, one cycle lost in the extra branch delay slot). Alternatively you could use a global table of function pointers and use #define macros to rewrite direct function calls into indirect ones, which would only add one cycle for cache hits (loading the pointer from IWRAM). Handling cache misses in the latter case is slightly harder but probably still doable.

The actual cache manager code wouldn't be particularly slow, especially since I'm an asm programmer; it would be mostly DMA speed limited. You could even use EWRAM as a L2 cache, perhaps, by doing large speculative block transfers between the end of the program loop and the start of vblank. (computing how long you have left and trying to do the largest possible DMA)

Whether it would turn out to be a performance improvement for any particular code can't really be determined without practical testing; I personally think it might well work OK, but that's just my intuition.

I'll hopefully be able to give it a shot in about two weeks once term is over (too much real work atm).

#16997 - poslundc - Sat Feb 28, 2004 3:54 pm

torne wrote:
You could even use EWRAM as a L2 cache, perhaps, by doing large speculative block transfers between the end of the program loop and the start of vblank.


I doubt you'd want to do that, seeing as DMA from EWRAM is much slower than DMA (or general sequential loading) from ROM.

Also, a loader program is going to have to either know the size of every function or load the function line-at-a-time watching for "bx lr" and also making sure to pick up any pool data. Not fun....

Again, it sounds like an interesting idea, but I expect that the overhead will prove to be a bit impractical versus DMAing individual time-critical functions into IWRAM as needed. Let us know how it goes if you do try it, though.

Dan.

#16998 - NitroSR - Sat Feb 28, 2004 4:11 pm

Wow, the plethora of replies from my question have been rather impressive! In regards to the code length issue I would assume that a utility could be constructed to identify the function segments within an .o file and create auxillery data that would essentially act as a table of functions and their lengths that when linked finds its way into ROM and can be accessed via the program code via specifically designed functions. Then, finding the code would be a matter of reading through the table to identify the functions you want to copy.

Essentially, I propose creating a .dll file embedded in ROM. How does gcc handle the creation of .so files? Perhaps a similar system would suffice. This would at least attack the problem of how much data to copy when swapping in new code.

#17004 - torne - Sat Feb 28, 2004 5:09 pm

poslundc wrote:
torne wrote:
You could even use EWRAM as a L2 cache, perhaps, by doing large speculative block transfers between the end of the program loop and the start of vblank.

I doubt you'd want to do that, seeing as DMA from EWRAM is much slower than DMA (or general sequential loading) from ROM.

So no, then. =) You could always transfer code to EWRAM too as it's still faster than ROM..

Quote:
Also, a loader program is going to have to either know the size of every function or load the function line-at-a-time watching for "bx lr" and also making sure to pick up any pool data. Not fun....

No, you compile your code with -ffunction-sections which creates seperate text sections for each function. The linker will then generate symbols which indicate their offsets and sizes without requiring you to do anything at all. =)

Quote:
Again, it sounds like an interesting idea, but I expect that the overhead will prove to be a bit impractical versus DMAing individual time-critical functions into IWRAM as needed. Let us know how it goes if you do try it, though.

I wasn't suggesting copying individual time-critical functions but rather *all* functions (a subset at any given time). Thus, you could compile your entire program as ARM code, avoiding the restrictions imposed on the optimiser by Thumb. Timer or vblank handling functions could even be precached just in advance of the relevant interrupt. The idea is that the overhead of doing the caching is offset by the ability to compile all code as ARM, with the additional benefit of simplicity for the programmer.

#17007 - tepples - Sat Feb 28, 2004 6:25 pm

torne wrote:
You could always transfer code to EWRAM too as it's still faster than ROM..

Say what? EWRAM works at 2/2 wait state. ROM works at 3/1 wait state and will on average run a touch under 50 percent faster. (The first number, denoting non-sequential access time, comes into play on branches, but ARM code, with its conditional execution of even non-branch instructions, has much fewer of those than Thumb code.)

Quote:
No, you compile your code with -ffunction-sections which creates seperate text sections for each function. The linker will then generate symbols which indicate their offsets and sizes without requiring you to do anything at all. =)

Does -ffunction-sections need any special support from the linker script? And would it work with nested functions (a GCC extension) so that both a function and the helper functions it calls can be transferred at the same time?

Quote:
I wasn't suggesting copying individual time-critical functions but rather *all* functions (a subset at any given time). Thus, you could compile your entire program as ARM code, avoiding the restrictions imposed on the optimiser by Thumb.

This is exactly how a compiler toolchain for the Atari Jaguar worked near the end of the Jag's life.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#17015 - torne - Sat Feb 28, 2004 10:10 pm

tepples wrote:
Does -ffunction-sections need any special support from the linker script? And would it work with nested functions (a GCC extension) so that both a function and the helper functions it calls can be transferred at the same time?

Nested functions are un-nested as they are compiled, so they will be in different sections. It should still be able to cache them both at the right times. You may need to tweak the linker script to make it generate the right symbols for the function sections, maybe not.