gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

C/C++ > Need info on C optimizing

#19613 - DiscoStew - Thu Apr 22, 2004 9:47 pm

Does anyone know of a website that deals with getting C code optimized for speed, or perhaps can tell me some efficient ways of optimizing C code?
I plan on releasing a demo sometime in the future of my sprite handler, of which I will not devulge any details yet.
I've optimized the code with what I knew, and I even placed it into IWRAM, but it is still not as fast as I would like it to be (31 to 41 scanlines for executing code in 3 different, but required, loops 28 times each, ugh!). I'm not quite ready to deal with Assembly yet since I don't know it, though I would guess that that would make a big difference.

If it would help, the following is what is being used a lot...
---Pointers to structs in both EWRAM and ROM
---Fixed Point Multiplying between 2 variables (no constants)
---Trigonometry, and angles outside of 360 degrees (0-359)
---Sqrt (for trig hypotenuse)
---Continual scanning by index in arrays
---for loops in the 3 main loops

Anything that would help would be greatly appreciated, even specific compiler options to make code faster, since all I use is the GBA Project AppWizard. Optimized Arctan2, Sqrt, and fixed-point Division functions would help. My ArcTan2 function uses the BIOS (for the moment), but my Sqrt and fixed-point Division functions are hand-coded. I'll show them if someone wants me to, and maybe someone could help me make them faster. thx
_________________
DS - It's all about DiscoStew


Last edited by DiscoStew on Thu Apr 22, 2004 9:54 pm; edited 1 time in total

#19615 - alek - Thu Apr 22, 2004 9:53 pm

you could use KevinW's sqrt routine. It's posted somewhere on the main site... It's supposed to be very fast.
From the readme

"It's a 3 cycles per bit method. The routine takes 51 cycles total to execute from IWRAM."

#19617 - poslundc - Thu Apr 22, 2004 10:05 pm

DiscoStew wrote:
---Trigonometry, and angles outside of 360 degrees (0-359)


Use look-up tables in EWRAM, and make them range from 0 to 256 instead of 360 (or 512 if you want more precision). Then you can just bitwise-AND your angle with 255 (or 511 for the more-precise version) to get the correct index in the range of your table (in case the angle is outside of your range).

Quote:
---Sqrt (for trig hypotenuse)


There are a couple of really fast distance approximation functions out there. They don't replace square-root for general purpose stuff, but if you're just trying to get the distance between two cartesian points then they can't be beat. Search the forum for these.

You need to profile your code. The easiest way to do this is to turn off the various display elements and change the colour of the backdrop (zeroth BG palette index, ie. 0x05000000) at the beginning and end of whatever parts of the code you want to profile. Don't worry about optimizing the 90% of your code that the CPU spends only 10% of its time in; instead fine-tune the 10% of your code that the CPU is spending 90% of its time in.

Dan.

#19621 - DiscoStew - Thu Apr 22, 2004 10:33 pm

About KevinW sqrt function, like I said before, I'm not ready for Assembly yet, but would I be able to incorporate his function the same way as calling BIOS functions from C? Like asm volatile {"..."} with "..." being the assembly code? this brings me to another thing. I can't remember the author, but this person made an assembly division function by the names of DIV and DIVS. I've trying incorporating those function much like the BIOS functions, but it doesn't work. Has anyone else got these to work in Devkitadv GCC?
_________________
DS - It's all about DiscoStew

#19622 - poslundc - Thu Apr 22, 2004 10:51 pm

It is much, much easier to compile assembly code in separate files than by using inline assembly. If you feel you must use inline assembly, start with the rules of engagement.

Dan.

#19653 - Miked0801 - Fri Apr 23, 2004 4:10 am

Quote:

If it would help, the following is what is being used a lot...
---Pointers to structs in both EWRAM and ROM
---Fixed Point Multiplying between 2 variables (no constants)
---Trigonometry, and angles outside of 360 degrees (0-359)
---Sqrt (for trig hypotenuse)
---Continual scanning by index in arrays
---for loops in the 3 main loops


Pointers to structs by themselves isn't bad per se. It's when you have pointers to pointers to pointers to structs in an inner loop. That'll cost you. Create a local var for those to save the cycles. Also, of course compile your code as ARM and in IWRAM.

For fixed point muls, it's not really that slow in ARM. Depending on Optimization levels (you are using either -O2 or -O3 right?) these will be fairly speedy.

Trig is a 256 entry LUT as per previous posts.

Sqrt - I rarely see anything in an inner loop that needs a sqrt. Either use Larger term plus 3/8 smaller term or use the squares of the terms for your answer (or better yet, do vector math instead to eliminate the need for them and lots of trig to boot)

Continual scan of arrays is going to be slow. Make sure that what you are scanning is in IWRAM (ideal) or ROM (next choice if scans are sequential.) Again - this makes think there are algorithm level optimizations to be had if you asking about this as continual monitoring of memory is gonna be pretty slow. Also, if your data is 4-byte aligned, read many entries at a time if possible to minimize memory load times.

For loops are pretty efficient - though if you are worried, unroll the loops a bit to get some performance back.

#19667 - DiscoStew - Fri Apr 23, 2004 7:59 am

To Miked0801:

- The pointers I have point directly to the structs needed, so I guess no problems there then.
- I've got my function in IWRAM, but that doesn't mean it is compiled as ARM? How would I go about doing that?
- Using -O3 instead of -O2 doesn't seem to help.
- The reason for asking about continual scanning of arrays is because I'm doing sprite priority. I'm using a sorting method, but I'm not swapping anything. I'm setting a priority value to each sprite which decides which place in OAM the sprite data will go. All sprites start at placement value 0. When comparing 2 sprites priority values, the sprite with the bigger value gets a +1 added to its placement value. After going through 2 for loops, each sprite will have a unique placement value, which sets where in OAM the data should go. For doing 28 sprites, this takes up a good 5-8 scanlines per frame to do. Not only that, but if a sprite is offscreen, I have to go through the entire lineup and do a -1 to all that have a placement value greater than that off-screen sprite, which is a problem considering that more off-screen sprite cause a slow-down. It's bad because this off-screen looping stuff has to be separate from the priority loops because of all the code in-between.

To poslundc:

- Is the distance formula you are talking about this...
Code:

(x < y) ? (x + (((y << 2) + y) >> 4)) : (y + (((x << 2) + x) >> 4))

I tried it, but now everything is off. I guess that 7% error really got me.

- I don't think I can convert to using a 256 or 512 LUT array at the moment because I am currently using the BIOS Arctan2, or perhaps I'm not quite understanding your method.



Here are my versions of Sqrt and Division, in case someone would have asked.
Code:

IN_IWRAM s32 Fast_Sqrt(s32 value)
{
  s32 result = 0;
  u32 SqrtInc = 1;
  if(value < 1024)
  {
    if(value < 0) return (0);
    while (value > 0)
    {
      value -= SqrtInc;
      SqrtInc += 2;
      result++;
    }
    if(value < 0)
      result--;
  }
  else
  {
    asm volatile ("mov r0, %1\nswi 0x080000\nmov %0, r0": "=r" (result): "r" (value): "r0");
  }
  return (result);
}

IN_IWRAM s32 Fast_Div(s32 numerator, s32 divisor, u32 firstbitcheck)
{
  s32 result = 0;
  u32 holdnumerator = numerator & 0x7FFFFFFF, holddivisor = divisor & 0x7FFFFFFF, divinc;
  if(!numerator || !divisor) return (0);
  holddivisor <<= firstbitcheck;
  divinc = 1 << firstbitcheck;
  while (divinc > 0)
  {
    while (holdnumerator > holddivisor)
    {
      holdnumerator -= holddivisor;
      result += divinc;
    }
    holddivisor >>= 1;
    divinc >>= 1;
  }
  if((numerator > 0) && (divisor > 0))
    return (result);
  if((numerator < 0) && (divisor < 0))
    return (result);
  return (-result);
}


I was up late on the night I did the division function, so the names may not be proper.
For the Sqrt function, I chose the decision value at 1024 because that is where my code becomes slower that the BIOS Sqrt. I guess I'd better try KevinW's sqrt method.

Another question, is there a faster way of getting the square of a variable without multiplying itself? -- var * var = var ^ 2
_________________
DS - It's all about DiscoStew

#19670 - alek - Fri Apr 23, 2004 8:26 am

DiscoStew wrote:
I can't remember the author, but this person made an assembly division function by the names of DIV and DIVS.


Do you mean http://www.peter-teichmann.de/ahinte.html

#19691 - DiscoStew - Fri Apr 23, 2004 4:25 pm

Yeah, thats him. I tried using his code through inline assembly, but it didn't work. Suggestions about separate files were made, so I'll have to try that, but should there be any modifications mae to work as a separate file when combining with C files? If I find out about that, then I could get KevinW's sqrt function added also. Like I said, I practically have no ASM knowledge.

About the whole compile to ARM stuff, I found this,

http://forum.gbadev.org/viewtopic.php?t=2842

and what I understood from torne (and you poslundc) was that by default the code is compiled to ARM. To compile to THUMB, use -mthumb. To allow ARM and THUMB code to work together, use -mthumb-interwork. Now the question is this, how can I compile code for both ARM and THUMB at the same time? Obviously all code in IWRAM should be in ARM, but all code in ROM should be THUMB. Does the compiler choose which code should be ARM or THUMB, or do I have to specify? If so, how do I specify, since I'm compiling both types under the compiler at the same time?
_________________
DS - It's all about DiscoStew

#19693 - poslundc - Fri Apr 23, 2004 4:43 pm

The simplest solution is don't compile both at the same time. Call gcc with the switch -c to generate intermediate output files, once for ARM/IWRAM and once again for Thumb/ROM, then call the linker ld (or just gcc again) on the output files to link them into your final file.

This is where makefiles become extremely useful.

On the other hand, if you write all of your ARM/IWRAM code in ASM and put them in their own separate files with the extension .s (or .S to run the C preprocessor on them), and all of your remaining C files are for Thumb/ROM, you can pass all of your files right into gcc with -mthumb -mthumb-interwork. The two gcc switches will apply to the C files only, since the assembly files state in them what type of code they generate and for where.

Dan.

#19697 - Miked0801 - Fri Apr 23, 2004 5:04 pm

Hmmm. I 'm pretty sure the BIOS divide is going to be faster than that C divide. I'll process the other comments a bit later after I feed my daughter :)

#19725 - DiscoStew - Fri Apr 23, 2004 9:44 pm

Well, I got that Distance formula working. The x and y values needed to be switched in the actual formula. But after using it, everything was about 1-2 pixels off from where they should have been, so unless there is a faster and more accurate method, I guess I'll stick with my Sqrt function, or use KevinW's sqrt when I add assembly to the project.

Miked0801, I just checked my Div function compared to the BIOS Div again, and it shows that mine is actually quicker. However, mine only does int divides, no remainder or other special stuff. I'm contempt with mine until I get Peter's Div working.

Thanks for the info poslundc, I'll have to try stuff that way.
_________________
DS - It's all about DiscoStew

#19727 - alek - Fri Apr 23, 2004 10:14 pm

DiscoStew wrote:
, or use KevinW's sqrt when I add assembly to the project.


if you want one that works with the GAS assembler I could e-mail it to you... Just let me know.

#19731 - Miked0801 - Fri Apr 23, 2004 11:34 pm

Lol. OK, more proof that Nintendo coded for space and not speed. I'd be curious to see the disassembly of your divide. :)

#19743 - DiscoStew - Sat Apr 24, 2004 7:15 am

Well, I was able to add both Peter's Div function and KevinW's Sqrt function through (gasp! dare I say) inline assembly. Rail me all you want, I'm glad to even have ASM in my project. They seem to work in my main() function. However, I am having problems when actually trying to call them from my function which requires the speed. When compiling with the inline assembly code called from my function, it brings up errors, as many as there are labels, saying "Symbol *somelabel* already defined. To keep it short, I'll show the beginning and ending of each function...
Code:

IN_IWRAM u32 AsmDiv(u32 dividend, u32 divisor)
{
    u32 result;
    {asm ("
      .ALIGN
      .ARM
      mov r0, %1
      mov r1, %2
      cmp r1,#1
      beq uo
      mov r2,#0
      cmp r1,r0,lsr#15
      ......
  u0:
      mov r1,r0
      mov r0,r2
      @movs pc,r14
      b ending
  uo:
      mov r1,#0
      @movs pc,r14
  ending:
      mov %0, r0
  ": "=r" (result): "r" (dividend), "r" (divisor): "r0", "r1", "r2"
  );}
return (result);
}

IN_IWRAM u32 AsmSqrt(u32 value)
{
    u32 result;
    {asm ("
      .ALIGN
      .ARM
      MOV r0, %1
      MOV r1,#(3 << 30)
      MOV r2,#(1 << 30)
      CMP r0,r2
      SUBHS r0,r0,r2
      ADC r2,r1,r2,LSL #1
      ......
      CMP r0,r2,ROR #(2 * 15)
      SUBHS r0,r0,r2,ROR #(2 * 15)
      ADC r2,r1,r2,LSL #1
      @ Rounding
      @IF :DEF:USE_ROUNDING
      @CMP r0,r2
      @ADC r2,r2,#1
      @ENDIF
      BIC r0,r2,#(3 << 30)
      MOV %0, r0
  ": "=r" (result): "r" (value): "r0", "r1", "r2"
  );}
return (result);
}

Note: Comments were placed because of compile problems and/or inaccuracy with what I wanted.

When I run code outside of IWRAM, I do a FarCall to the functions, but when I am calling IWRAM function from IWRAM, they're just normal calls. JFYI, my function is set into IWRAM, and I was just thinking that maybe calling the inline assembly code set in IWRAM from a function in IWRAM is probably the problem, though I don't know what that could be.
_________________
DS - It's all about DiscoStew

#19747 - poslundc - Sat Apr 24, 2004 1:35 pm

Post the code where you call the functions.

I just got an uneasy feeling... you aren't linking it to the module by #including the C code, are you?

Dan.

#19754 - DiscoStew - Sat Apr 24, 2004 8:49 pm

I admit it, I've been linking everything (even code) by #include, so I spent this morning getting rid of all unecessary #includes, and editing the MAK file made by the GBA AppWizard. Now it's working again, and the ASM code working in my function. The ASM divide saved only about <1 scanlines, though now I don't have to worry about my own divide function with the extra variable. The ASM sqrt, however, saved me about 3-4 scanlines, which really helped.

Now before I do any more optimizations, I'm going to try get the MAK file to separate my files for compiling to ARM and THUMB. Does THUMB code require that I add -mthumb -mthumb-interwork to work with ARM code?
_________________
DS - It's all about DiscoStew

#19755 - poslundc - Sat Apr 24, 2004 11:09 pm

DiscoStew wrote:
I admit it, I've been linking everything (even code) by #include, so I spent this morning getting rid of all unecessary #includes, and editing the MAK file made by the GBA AppWizard. Now it's working again, and the ASM code working in my function. The ASM divide saved only about <1 scanlines, though now I don't have to worry about my own divide function with the extra variable. The ASM sqrt, however, saved me about 3-4 scanlines, which really helped.


The only thing you should ever #include is header files (.h), and they should only ever contain structure definitions, macro definitions, function prototypes, and declarations of extern variables. They should never contain any actual, non-extern variable declarations, or any executable code (ie. functions). (Exception: static inline functions.)

Quote:
Now before I do any more optimizations, I'm going to try get the MAK file to separate my files for compiling to ARM and THUMB. Does THUMB code require that I add -mthumb -mthumb-interwork to work with ARM code?


Yes.

Dan.

#19888 - DiscoStew - Tue Apr 27, 2004 5:10 pm

Well, it's been a few days since I was programming, but now the files are separated so that they are split into ARM and THUMB code. However, I can't really check to see how much of an improvement this would get me, considering that I have very little code in THUMB (just the main(), sprite loading, and a few other functions). The only thing I'm currently profiling is that one function that does all the sprite updating.

While I continue to optimize my code, I would still like any other general suggestions that could help. thank you all
_________________
DS - It's all about DiscoStew

#19995 - poslundc - Thu Apr 29, 2004 3:23 pm

Unless you're writing an interrupt service routine or something very low-level like an audio mixer or mathematical function (such as the div function), there is probably little reason for you to compile to anything other than Thumb/ROM at this stage. Premature optimization is the root of all things evil.

Dan.