gbadev.org forum archive

I've got this killer function which takes loads of CPU time (it's in floating-point too), it's recursive, and has huge stack frames. There are cases where the function goes a little nuts and runs out of stack, crashing the game.

I've tried improving it's performance by converting it to fixed-point, but the compiler seems to generate lots of temporaries which are getting stored to the stack. This means the stack frames are even larger than normal and the game crashes even easier.

I can relocate the stack (for the duration of the function) by just moving $sp to a place in main memory, but obviously the performance dips a bit. I'd ideally like to use the DTCM as much as possible, and the ARM9's CP allows you to move the address of the DTCM so can I just move it to the very end of memory? That way when it'd run out of stack it'd just spill into main memory...

However (the reason I'm posting), what happens to interrupt handlers when I do this? Do they still expect the stack to exist in the previous position? And isn't the IPC struct at the very end of main memory? Anything else I should be worried about?
_________________
Big thanks to everyone who donated for Quake2

Yeah, the IRQ stack would be pointing to garbage if you yank DTCM out from under it. You could offset it by however much you move DTCM though. SVC stack (used by SWI functions) may be pointing to DTCM as well.
Make sure you disable interrupts while fiddling with things.

As for the IPC, you could just put DTCM overlapping it, so if the stack spills into main RAM, it will be below IPC. Since IPC is always(?) accessed in the non-cached mirror of main RAM anyway, and stack would be much better if cached, then the overlap shouldn't cause any trouble.

Of course, using main RAM as stack will be chopping down your total available memory, to make sure that address range is always available. I'd try to rewrite the function to do manual recursion first.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

It's probably an extremely bad idea to move DTCM during program execution, I certainly wouldn't attempt it.

To move dtcm you'll need to adjust the linkscript slightly. The easiest, and my personally recommended, way to do this is use a linkscript fragment in your project.

Create a file called dtcm.ld in the root of your arm9 code which contains this

Code:

MEMORY {
dtcm : ORIGIN = 0x023f8000, LENGTH = 16K
}

Then add ../dtcm.ld to LDFLAGS

You should get a " warning: redeclaration of memory region 'dtcm'" warning when your project gets linked and dtcm should now be at the end of main memory.

Done this way the interrupt handlers & stacks will be happily relocated at compile time. The only potential problem here is that malloc may give you memory which overlaps your stack but hopefully you're not tight enough on memory for that to be an issue.

I had to go through libnds and remove a bunch of absolute addresses when I moved dtcm before so everything in there should be fine.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

Last edited by wintermute on Fri Nov 16, 2007 11:47 am; edited 1 time in total

Are you talking about one of the BSP checks? If so, those aren't too hard to convert to use a hand-made stack instead of native C recursion, which would save tons of stack space, and it's faster to boot.

Here's a Quake-like ray-vs-BSP test outline using native recursion and omitting terminal cases:

Code:

int bspTest(Ray* r, Node* n)
{
if(r is completely on inner side of n)
   return bspTest(r, n->inner);

if(r is completely on outer side of n)
   return bspTest(r, n->outer);

Ray nearRay = part of r on near side of n;

if(bspTest with nearRay shows collision)
   return result of nearRay test;

Ray farRay = part of r on far side of n;
return result of farRay test;
}

You could translate it to this, which is similar to what I use:

Code:

struct Iteration
{
Ray r;
Node* n;
};

int bspTest(Ray* wholeRay, Node* root)
{
Iteration stack[MAX_RECURSION_DEPTH];
Iteration* sp = &stack[0];

// Push initial iteration onto stack.
sp->r = *wholeRay;
sp->n = root;
++sp;

while(sp > &stack[0])
{
   pop an iteration off the stack and decrement sp;

   if(ray is completely on inner side of n)
   {
      push iteration for inner side;
      continue;
   }

   if(ray is completely on outer side of n)
   {
      push iteration for outer side;
      continue;
   }

   push iteration for far side;
   push iteration for near side;
}
}

Yeah, you're exactly right! It's the BSP collision stuff and is used for *everything*. The problem is I don't really know anything about BSPs so I can't do anything smart regarding optimising it - I can just do my usual stuff...
I had a go (again) last night at improving the performance/stack usage of the function but I'm still not happy. I may just post the function up here to see if the people who know what they're doing are able to improve it!

Btw ta wintermute, I'll have a go with this...
_________________
Big thanks to everyone who donated for Quake2

Translating to manual recursion is a mechanical process, really. You don't have to know anything about how it works, as long as the same operations are happening. Just about any recursive function can be translated into this format:

Code:

ReturnType recursiveFunction(Arguments args)
{
// "Iteration" struct described later.
Iteration stack[MAX_ITERATIONS];

// Stack pointer.
Iteration* sp = &stack[0];

sp->state = ST_ENTRY;
sp->args = args;
++sp;

while(sp > &stack[0])
{
   Iteration it = *sp;
   --sp;

   // 'state' enum described later.
   switch(it.state)
   {
   case ST_ENTRY:
      ...

   case ...
   }
}
}

When you're at the top of the 'while' loop, that's like re-entering the function, whether it's beginning a new call or returning from a nested call.

The 'state' enum describes what stage of execution the iteration is at: whether it's just beginning, or returning from the first recursive call, or from the second recursive call, etc.

Besides that, the Iteration struct contains the arguments for each recursive call, as well as any other would-be local variables that are needed upon returning from a nested call.

Of course, knowing how it the original function works often results in better optimizations and/or simpler code, but this translation alone will solve the problem, I guarantee. My BSP code was absolutely hopeless until it used manual recursion. Just a few iterations would blow the stack.

Btw I forgot to say thanks for the heads-up on this. I just finished converting the BSP traversal to this non-recursive format and it (obviously) no longer runs out of stack! The magic button which crashes the game (in e2m2) no longer works! Excellent :-)
It's a really interesting way of programming something and definately makes you think about what you're writing. Kinda reminds me of building and executing functions on the fly in Scheme whilst at uni! (a world of pain)

Now it's time to go to work on this function and make it faaast...
_________________
Big thanks to everyone who donated for Quake2

Would you mind telling us how you went from recursive version to a non-recursive version? I'm obviously interested. :)

I'm working on it right now, but once when I'm done I'll upload both the original (floating-point) recursive version and the flat (floating-point version) just so you can see how the transformation works. If you care, I'll also show the fixed-point version so people can scrutinise/fix it!

Basically, using the code in this thread as a guide, I maintain my own stack and push and pop data onto it. The recursion gets turned into a big switch statement which decides where abouts in the flow of the code you are (yeah, it sounds complicated - wait for the code). As the individual calls to the functions need to return true/false, this needs to be recorded too. Plus when a function does a return to the previous stack frame, that result isn't returned directly straight to 'the top' - the program may then take that return value and manipulate it. So this required the concept of a link register, to say where to continue from when returning to the previous 'stack frame'.

Again, it'll make sense once you see the code!

But having now trawled through the function, I know exactly what it does. Plus as it never calls itself (in the C sense) I know just where the function is called from (in the rest of the program) and how it can be called. This has made converting it to fixed-point a breeze! Still needs to be faster though...
_________________
Big thanks to everyone who donated for Quake2

gbadev.org forum archive

DS development > Relocating the stack with the PU

#124536 - simonjhall - Fri Apr 06, 2007 6:23 pm

#124539 - DekuTree64 - Fri Apr 06, 2007 7:26 pm

#124746 - wintermute - Sun Apr 08, 2007 4:27 am

#124754 - sajiimori - Sun Apr 08, 2007 6:00 am

#124878 - simonjhall - Mon Apr 09, 2007 2:23 pm

#124892 - sajiimori - Mon Apr 09, 2007 6:29 pm

#125371 - simonjhall - Thu Apr 12, 2007 9:46 pm

#125491 - Jevin - Fri Apr 13, 2007 10:28 pm

#125497 - simonjhall - Fri Apr 13, 2007 11:11 pm