#44353 - staticboy - Tue May 31, 2005 10:52 pm
I'm certainly no C/C++ guru, but have reached the level where I'm thinking about what makes the best/fastest code for a GBA game.
My thoughts are concerning the pros and cons of using either function-like macros vs. inline functions (or normal functions for that matter). Traditionally most C/C++ developers would go for functions over macros because they can be debugged and made type safe. However, function calls increase complexity in the final compilation leading to (slightly) slower code. For desktop applications this is probably of no consequence. For the GBA it could be totally different! I have read that standard C header files can contain functions with macro alternatives (masking macros) that execute faster than the functions of the same name, but these should be reserved for use in release code.
All of this was triggered when I looked at my function to perform a DMA fast copy:
Code: |
//defines needed by DMAFastCopy
#define REG_DMA3SAD *(volatile unsigned int*)0x40000D4
#define REG_DMA3DAD *(volatile unsigned int*)0x40000D8
#define REG_DMA3CNT *(volatile unsigned int*)0x40000DC
#define DMA_ENABLE 0x80000000
#define DMA_TIMING_IMMEDIATE 0x00000000
#define DMA_16 0x00000000
#define DMA_32 0x04000000
#define DMA_32NOW (DMA_ENABLE | DMA_TIMING_IMMEDIATE | DMA_32)
#define DMA_16NOW (DMA_ENABLE | DMA_TIMING_IMMEDIATE | DMA_16)
// DMAFastCopy only for 16/32-bit immediate transfers
void DMAFastCopy(void* source, void* dest, unsigned int count, unsigned int mode)
{
if (mode == DMA_16NOW || mode == DMA_32NOW)
{
REG_DMA3SAD = (unsigned int)source;
REG_DMA3DAD = (unsigned int)dest;
REG_DMA3CNT = count | mode;
}
} |
So, I got to thinking that this could easily be re-written as a function-like macro. Definitely not so easy to read and would be almost impossible to debug, but acceptable risks for such a well-defined action if it proved to make for faster execution:
Code: |
// DMA Fast Copy MACRO
#define DMAFASTCOPY(source, dest, count, mode) if((mode)==DMA_16NOW||(mode)==DMA_32NOW){REG_DMA3SAD=(unsigned int)(source);REG_DMA3DAD=(unsigned int)(dest);REG_DMA3CNT = (count)|(mode);} |
During my research in the deeper depths of various online C references I discovered "inlining" functions. So, my DMAFastCopy could be transformed into an inline function by the simple addition of the inline directive:
Code: |
inline void DMAFastCopy(void* source, void* dest, unsigned int count, unsigned int mode)
{
if (mode == DMA_16NOW || mode == DMA_32NOW)
{
REG_DMA3SAD = (unsigned int)source;
REG_DMA3DAD = (unsigned int)dest;
REG_DMA3CNT = count | mode;
}
} |
As the inline directive is only a "hint", how do I find out what optimisation choice the compiler actually made? Did it perform an inline replacement each time I used the function or did it decide that going for a proper function call was the best way forward?
I see no difference in finally binary sizes using any of these methods or any noticable increase/decrease in performance. Maybe my game just doesn't have enough bells and whistles running yet to really tax the CPU for this sort of fine tuning to make any difference?
Any thoughts or experiences on this subject will be received with great interest!
#44358 - strager - Tue May 31, 2005 11:43 pm
When compiling your file, add a little -save-temps to the command line. It will make a file name.s, and you can look in there for your answer.
Me, I would prefer the use of standard functions. Why? It takes up less space in ROM, and it is readable (for the most part).
I don't know about inline, but it looks like C++ syntax to me, and that means it is un-trustable (IMO). Be warned.
#44362 - Dwedit - Wed Jun 01, 2005 12:02 am
You know, backslash + newline is the same as nothing, so to continue a line over multiple lines, just stick a backslash as the last character.
#define something \
line 1 \
line 2 \
line 3...
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#44365 - sajiimori - Wed Jun 01, 2005 12:11 am
Compile in C99 or C++ mode and use inline functions. They are 100% as fast as macros.
Far from being untrustworthy, C++ features can let you write code that is fast and good.
All that aside, forget about optimizing until you have a bottleneck. In the case of large DMA copies, the copying itself will be the majority of the work, not the setup, so inlining is not likely to be very important.
#44366 - poslundc - Wed Jun 01, 2005 12:14 am
I googled up this entry from the GCC documentation.
According to it, you must be building at least -O1 in order for functions to inline. This page suggests that it must be -O3 (unless you specify -finline-functions), although it may be that this is only for GCC to automatically inline function that it feels are worth inlining, without you providing the keyword.
From the first article, as to when functions can't be inlined:
Quote: |
Note that certain usages in a function definition can make it unsuitable for inline substitution. Among these usages are: use of varargs, use of alloca, use of variable sized data types (see Variable Length), use of computed goto (see Labels as Values), use of nonlocal goto, and nested functions (see Nested Functions). Using -Winline will warn when a function marked inline could not be substituted, and will give the reason for the failure. |
None of those are frequently-used features in GBA programming.
Inlined functions are generally preferable to macros, as they are less error-prone, provide features like type-checking, and are easy to switch to actual functions should you need to down the road.
Dan.
#44396 - Cearn - Wed Jun 01, 2005 9:08 am
sajiimori wrote: |
They are 100% as fast as macros. |
This isn't *quite* true. In pretty much all cases I've checked there is no difference in output between macros and inline functions, but not all. In particular, it doesn't seem to do all that is possible when you're using constants, even if the arguments have the const qualifier. This could cause ugly code when you try to piece together all the constants for a register: instead of folding all of them into a single value, you could end up with a bunch of loads, ORs and shifts. So that'd probably make them ~95% as fast as macros :). Note: I've only tested this with just C, not C++ or a proper C99 option, so that may be my problem.
Aside from that, I agree that inline functions are much friendlier than macros in use. And they can be a lot faster than normal functions if the body is short; and sometimes even produce smaller code because it can be optimised with the rest of the caller.
#44412 - Suboptimal - Wed Jun 01, 2005 3:22 pm
Note that there is a rather nasty potential land mine in this macro:
Code: |
// DMA Fast Copy MACRO
#define DMAFASTCOPY(source, dest, count, mode) if((mode)==DMA_16NOW||(mode)==DMA_32NOW){REG_DMA3SAD=(unsigned int)(source);REG_DMA3DAD=(unsigned int)(dest);REG_DMA3CNT = (count)|(mode);}
|
Let's say you tried to do something like this:
Code: |
if ( bFoo )
DMAFASTCOPY( ... );
else
DMAFASTCOPY( ... );
|
Seems trivial enough, right? Unfortunately, the second statement will NEVER get called. It will expand to this:
Code: |
if ( bFoo )
if((mode)==DMA_16NOW||(mode)==DMA_32NOW)
{ .... }
else
if((mode)==DMA_16NOW||(mode)==DMA_32NOW)
{ .... }
|
The macro couild be re-written to avoid this, but I tend to steer away from macros just because I always end up shooting myself in the foot with errors like this, usually long after I have forgotten I bought a gun. I guess problems like that are rarer when you're doing the last wee bit of optimizing, since you're probably being more meticulous.
Incindentally, does anyone know if the code in inlined functions is run in the same scope as the calling function, or does it follow the traditional scoping rules of a called function. I would assume it scopes properly, otherwise it would behave differently when the compiler can actually inline it and when it can't, which would be evil.
#44415 - poslundc - Wed Jun 01, 2005 4:02 pm
Suboptimal wrote: |
Incindentally, does anyone know if the code in inlined functions is run in the same scope as the calling function, or does it follow the traditional scoping rules of a called function. I would assume it scopes properly, otherwise it would behave differently when the compiler can actually inline it and when it can't, which would be evil. |
Inlined functions follow the conventional scoping rules for functions.
With all the pro-inlined-functions talk in this thread, I want to add to my previous post to mention that macros have their time and place as well. The difference between the two is that macros were designed to provide a feature that can be used to solve a specific problem, whereas inlined functions were designed to solve a specific problem.
So, since inlined functions were designed for the specific problem of letting you create small functions without causing performance to degrade, use them for that. Use macros when you don't specifically want to do that, but need to represent some piece of code or value with a different, global token.
Dan.
#44427 - sajiimori - Wed Jun 01, 2005 6:45 pm
Quote: |
In pretty much all cases I've checked there is no difference in output between macros and inline functions, but not all. |
You caught me being idealistic again. ;) Inline functions should be 100% as fast as macros, but the reality of imperfect compilers messes that up a bit.
#44547 - staticboy - Thu Jun 02, 2005 8:12 pm
Thanks for all your replies, very informative and plenty of food for thought. I guess I'll just have to take it on a case by case basis, trying different methods and measure the effect on final performance.
This is probably a silly question: Is there any way to benchmark ROMS?
#44561 - tepples - Thu Jun 02, 2005 9:49 pm
staticboy wrote: |
This is probably a silly question: Is there any way to benchmark ROMS? |
No$gba has a profiler, but most hobbyists can't afford it. Another technique, which works on No$gba freeware, VBA, and hardware is to execute something repeatedly in a loop and use either timers or vcount to see how long it takes.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#45085 - Mollusk - Tue Jun 07, 2005 4:24 pm
I actually tested some small bits of codes with both macros and inline functions, with timers, on gba. It looked like macros were a little bit faster than inline functions, not sure why.
#45086 - strager - Tue Jun 07, 2005 4:28 pm
I believe it is because inline functions take the time to push/pop registers that don't need pushing/popping. I've been in favor of macros throughout this discussion, and I guess I am on the winning team. :-)
#45100 - sajiimori - Tue Jun 07, 2005 6:10 pm
If the inline version is slower, you are either building in a way that prevents the compiler from optimizing properly, or it is a compiler issue.
There is no single reason that an inline version would produce slower code. It's not so much the "winning team" as the "placate the broken compiler" team, though that's occasionally necessary in real life.
Again, when it matters, always check the output.
#45115 - tepples - Tue Jun 07, 2005 7:41 pm
Have you tried compiling each version (with macros and with inline functions) with -S instead of -c, sending output to a .s file, and comparing the assembly language files that GCC generates?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#45296 - Cearn - Thu Jun 09, 2005 9:27 am
For some reason inline functions consistently put things inside a loop that should be there. For example
Code: |
static inline void m3_plot(int x, int y, u16 clr)
{ vid_mem[y*240+x]= clr] }
// draw a rectangle
for(iy=8; iy<64; iy++)
for(ix=8; ix<64; ix++)
m3_plot(ix, iy, 0xdead);
|
Because 0x06000000 is too large a number for thumb code to do in one go, it has to be constructed in two instructions. For some reason, this happens inside the loop(s), rather than outside. The same thing happens to macros once the loops get a little more complicated. Mind you, this is still 3x faster than in m3_plot was a real function.
Another reason for slower code is if you use non-word types for parameters, as they will probably have code for sign-extension.
Not using the const qualifier is also a possible slow-down, especially if you intend to combine them with arithmetic/logic operators: instead of all the constants folding together, you might actually get all the adds, ors, shifts etc that make up the code.
The optimiser can be a little fickle at times, especially when it comes to inline functions. Just changing the order of statements can sometimes make some difference. Macros may require a little more caution in use, but at least you know what you get.
Oh, and pushing/popping only happens for real functions, not inline. Perhaps when inline functions' code gets real hairy there may be stack handling too, but if that happens the code shouldn't be inlined anyway.
#45337 - Miked0801 - Thu Jun 09, 2005 7:35 pm
This of course being a perfect place for an assembly function.
Of course, I took it to too far of an extreme, but it was fun and I haven't had practice in a long while. Thumb is not nearly as fun as ARM :)
Code: |
@untested! - just spewed into the box for fun :)
@ preserve
push r4,r5,r6
@ Get start of video mem + 8 halfwords (r3 temp for now).
mov r3,#6
lsl r3,r3,#24
add r3,r3,#16
@ Load r2 with 480 for later increments
mov r2,#240
lsl r2,r2,#1
@ Get end y into r1 = vid_mem + 240 * 64 (*2 halfwords)
lsl r1,r2,#6
add r1,r1,r3
@ Get start y int r0 = vid_mem + 240 * 8 (*2 halfwords)
lsl r0,r2,#3
add r0,r0,r3
@r3 holds 32-bit 0xdeaddead
mov r3,$de
lsl r3,r3,#8
add r3,r3,$ad
@ do last 16-bits at once
lsl r1,r3,#16
add r3,r3,r1
@ copy into r4,r5,r6 to eliminate inner loop
mov r4,r3
mov r5,r3
mov r6,r3
lp:
@ blast - works because x is even
stmia r0,[r3,r4,r5,r6]
@ increment loop (by 240*2) and check against end
add r0,r0,r2
cmp r0,r1
bne lp
@ cleanup
pop r4,r5,r6
bx lr
|
#45418 - jma - Fri Jun 10, 2005 3:11 pm
tepples wrote: |
Have you tried compiling each version (with macros and with inline functions) with -S instead of -c, sending output to a .s file, and comparing the assembly language files that GCC generates? |
They will almost never be the same. In very simple situations they can be, but consider the following:
Code: |
#define RAND(min, max) \
((min) + (rand() * ((max) - (min)) / RAND_MAX))
inline int my_rand(int min, int max) {
return min + rand() * (max - min) / RAND_MAX;
} |
Now let's actually use each:
Code: |
int x1 = RAND(y * 100, y * 100 + 5);
int x2 = my_rand(y * 100, y * 100 + 5); |
The code generation for these will be terrible for the macro and much better for the inline function.
Jeff M.
_________________
massung@gmail.com
http://www.retrobyte.org
#45496 - tepples - Sat Jun 11, 2005 5:55 am
If you're referring to recalculation of reused temporary values in a macro, then what about a macro using a compound statement within an expression (a [url=http://www-2.cs.cmu.edu/cgi-bin/info2www?(gcc.info)Statement%20Exprs]GCC extension to C[/url])? How does that perform?
Code: |
#define RAND(min, max) \
({int _min = (min); \
(_min) + (rand() * ((max) - (_min)) / RAND_MAX)})
|
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#45552 - jma - Sat Jun 11, 2005 11:19 pm
tepples wrote: |
If you're referring to recalculation of reused temporary values in a macro, then what about a macro using a compound statement within an expression... ? |
As I don't use GCC (and never have), I wouldn't know. However, I do find that terribly unreadable. Does it actually return a value (no sarcasm, just an honest question as to whether or not it works)? Why not just use the inline function which is portable across compilers? :)
Jeff M.
_________________
massung@gmail.com
http://www.retrobyte.org
#51513 - ribrdb2 - Sat Aug 20, 2005 9:47 pm
tepples wrote: |
staticboy wrote: | This is probably a silly question: Is there any way to benchmark ROMS? |
No$gba has a profiler, but most hobbyists can't afford it. |
VBA also supports profiling with gprof. It's kind of tricky and somewhat limited, but very useful when you want to know where your time is being spent.