#2191 - animension - Fri Jan 31, 2003 12:22 am
Hey all,
I'm trying to benchmark some C code VS some ASM code of the same task, and I'm wondering how would be a good, accurate method to do so. Specifically, I'd like to bench the BIOS DIV util vs the "/" operator in integer divide operations. I have written a crude benchmarking test program and I think there's something inaccurate with it but I can't see anything out of the ordinary.
As I understand it, the BIOS util is supposed to be far superior to using the "/" operator. The code I have that uses the util is as follows:
Code: |
.code 32
.section .iwram
.align
.global biosdiv
biosdiv:
swi 0x60000
bx lr
|
As you can see, it's very simple, is for ARM and is placed in IWRAM for speedy ARM execution.
The code that benches the util against compiled "/" operations is as follows:
Code: |
#include "mygba.h"
extern "C" signed long biosdiv(
signed long dividend,
signed long divisor
);
unsigned int aframes, bframes = 0;
bool adone, bdone = false;
const unsigned long MAXCOUNT = 1000000;
const unsigned long PLACEBO = 7;
void vbl(void){
if(!adone){
aframes++;
} else if (!bdone){
bframes++;
}
ham_DrawText(0,0,"A: %d",aframes);
ham_DrawText(0,1,"B: %d",bframes);
}
int doatask(int input){
return (input / PLACEBO);
}
int dobtask(int input){
return biosdiv(input,PLACEBO);
}
int main(void){
ham_Init();
ham_SetBgMode(0);
ham_InitText(0);
ham_StartIntHandler(INT_TYPE_VBL,(void*)&vbl);
int dummy = 0;
unsigned long i;
for (i = 0; i < MAXCOUNT; i++){
doatask(i);
}
adone = true;
dummy = 0;
for (i = 0; i < MAXCOUNT; i++){
dobtask(i);
}
bdone = true;
while(1);
}
|
In a nutshell, this code first counts how many frames it takes to do MAXCOUNT (one million) aritmatic calculations with each iteration calculating "i" (iterator of the loop) divided by PLACEBO (in this case 7) using the C "/" operator. With each frame it updates the count of the frame and when the task is finished it proceeds to the second task, doing the exact same thing but calling the function made via ASM.
The results were disappointing. I got 542 frames for the "/" operator versus 1826 frames for the ASM function. This test was done on hardware. Could there be anything that needs to be checked for inaccuracies? As far as I know, I'm subjecting both functions to the same tests with the same amount of overhead needed to conduct the tests themselves... I find it strange that everyone raves about the BIOS util when the bench shows it being 4 times as slow... something is not right with my code.
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin
#2193 - dragor - Fri Jan 31, 2003 12:45 am
Here's one idea. With the / test, the test is done immediately, whereas the bios test uses an extra function call before it starts executing. The extra time for the bios might be because of this. Try putting your asm inline.
_________________
Sum Ergo Cogito
#2195 - animension - Fri Jan 31, 2003 1:06 am
Ah, I see. However, I ran into another problem. I changed "dobtask" to contain:
Code: |
int dobtask(int input){
int retval = 0;
asm volatile(
"add r0,%1,#0\n" // put value of input into r0
"add r1,%2,#0\n" // put value of PLACEBO into r1
"swi 6\n" // BIOS DIV util SWI
"ldr r2,%0" // put address of retval into r2
"str r0,[r2]" // store result of SWI 6 into memory address in r2
: "=m" (&retval) // address of retval is %0
: "r" (input), "r" (PLACEBO) // input is %1, PLACEBO is %2
: "r0","r1","r2" // we smash and crush r0,r1, and r2 registers
);
return retval;
}
|
and what I get when I compile it is:
Code: |
D:/ham/gcc-arm/bin/arm-thumb-elf-gcc.exe -I D:/ham/gcc-arm/include -I D:/ham/gcc-arm/arm-thumb-elf/include -I D:/ham/include -I D:/ham/system -c -DHAM_HAM -DHAM_MULTIBOOT -DHAM_ENABLE_MBV2LIB -O2 -DHAM_WITH_LIBHAM -mthumb-interwork -mlong-calls -Wall -save-temps -fverbose-asm test.cpp
test.cpp: In function `int dobtask(int)':
test.cpp:26: output number 0 not directly addressable
test.cpp:37: confused by earlier errors, bailing out
make: *** [test.o] Error 1
|
the same bleeding problem I had last time I tried to inline ASM anything. The code looks correct and I just cannot figure out why the compiler is whining.
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin
#2197 - Rich - Fri Jan 31, 2003 1:23 am
Try changing your code to...
Code: |
int dobtask(int input)
{
int retval = 0;
asm volatile
(
"mov r0,%1\n"
"mov r1,%2\n"
"swi 6\n"
"mov %0,r0"
: "=r" (retval)
: "r" (input), "r" (PLACEBO)
: "r0", "r1", "r2"
);
return retval;
}
|
#2201 - animension - Fri Jan 31, 2003 2:56 am
Wow that worked. What made the difference? using "=r" instead of "=m"? How does it make a difference? Also isn't the mov function limited to only 8bit numbers that are/can be shifted?
I also added code that would spit out the value of the dummy variable and print it to see accuracy for the two tests. For the "/" test I got a value of -1586372603 and for the SWI 6 test I got 1783293664.
Huh?
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin
#2203 - Rich - Fri Jan 31, 2003 3:27 am
Check out http://www.devrs.com/gba/files/asmrules.html for more info about inline asm, it explains things better than I can :)
Mov can move registers, which can contain any 32bit value, or immediate values which are limited to certain values.
As for the discrepancies in your accuaracy tests it looks like a problem with overflow in signed / unsigned numbers. Can you post the values you were testing with and maybe the code you used to print the results?
Rich.
#2204 - animension - Fri Jan 31, 2003 3:38 am
I tried running this on hardware and it crashed when it reached the start of the ASM benchmark. No idea why. Worked perfectly in VBA.
When I do run it in VBA I get different results for both tests. The "A" task test gets a value of -1586372603 and the "B" test gest a value of 1783293664.
Here is the code in full:
Code: |
#include "mygba.h"
unsigned int aframes, bframes = 0;
bool adone, bdone = false;
const unsigned long MAXCOUNT = 1000000;
const unsigned long PLACEBO = 7;
void vbl(void){
if(!adone){
aframes++;
} else if (!bdone){
bframes++;
}
ham_DrawText(0,0,"A: %d",aframes);
ham_DrawText(0,1,"B: %d",bframes);
}
int doatask(signed long input){
return int(input / PLACEBO);
}
int dobtask(signed long input)
{
int retval = 0;
asm volatile
(
"mov r0,%1\n"
"mov r1,%2\n"
"swi 6\n"
"mov %0,r0"
: "=r" (retval)
: "r" (input), "r" (PLACEBO)
: "r0", "r1", "r2"
);
return retval;
}
int main(void){
ham_Init();
ham_SetBgMode(0);
ham_InitText(0);
ham_StartIntHandler(INT_TYPE_VBL,(void*)&vbl);
signed long adummy = 0;
signed long bdummy = 0;
unsigned long i;
for (i = 0; i < MAXCOUNT; i++){
adummy += doatask(i);
}
adone = true;
ham_DrawText(0,4,"AV: %10d",adummy);
for (i = 0; i < MAXCOUNT; i++){
bdummy += dobtask(i);
}
bdone = true;
ham_DrawText(0,5,"BV: %10d",bdummy);
while(1);
}
|
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin
#2205 - Rich - Fri Jan 31, 2003 3:49 am
Okay firstly try changing the swi 6 to swi 0x60000
And secondly, try decreasing MAXCOUNT to something like 1000 to make sure that adummy and bdummy aren't overflowing. You can always increase MAXCOUNT bit by bit later.
#2206 - animension - Fri Jan 31, 2003 4:00 am
Ok that seemed to fix the problems. I do get both values being equal even with MAXCOUNT at one million. At one hundred thousand, it takes the hardware 55 frames to run test "A" and 151 frames to run test "B", which makes the BIOS util still slower... is there a way I can guarantee the function runs in IWRAM and is compiled for ARM using GCC?
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin
#2207 - Rich - Fri Jan 31, 2003 4:07 am
Code: |
#define IWRAM_CODE __attribute__ ((section (".iwram"), long_call))
IWRAM_CODE int dobtask(signed long input)
{
int retval = 0;
asm volatile
(
"mov r0,%1\n"
"mov r1,%2\n"
"swi 0x60000\n"
"mov %0,r0"
: "=r" (retval)
: "r" (input), "r" (PLACEBO)
: "r0", "r1", "r2"
);
return retval;
}
|
You might need to put the dobtask function in a separate .c file
#2209 - animension - Fri Jan 31, 2003 4:17 am
I relocated the function to a different file and had it designated to IWRAM. Running it on hardware with MAXCOUNT at one hundred thousand, the "A" test takes 55 frames (as it did before) and the "B" test dropped to 125 frames, still over 2x slower.
Is there anything else that can be done? I'd like to discover the joy of a fast divide, but so far it's way slower than compiled C.
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin
#2211 - Rich - Fri Jan 31, 2003 4:24 am
You might want to take a look at http://www.peter-teichmann.de/ahinte.html
Try putting that routine into iwram and seeing how that compares.
Just out of interest, what results do you get if you run the test on an emulator?
#2213 - animension - Fri Jan 31, 2003 4:47 am
I'll try running that routine later this evening as I have to run out for a bit. To answer your question about how the bench runs in an emu, test "A" gets 18 frames and test "B" gets 96. There's a much bigger gap in the emu probably because the PC is able to divide faster than the GBA, but when running BIOS it has to emulate the BIOS ROM algorithms I imagine.
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin
#2214 - Rich - Fri Jan 31, 2003 5:01 am
From the quick test that I've just done using a slightly modified version of that routine running in IWRAM, I'd say it's about twice as fast as the standard '/' routines when running on the VBA emulator.
Haven't got time tonight to dig out my GBA and do a hardware test, so I'd be interested to hear what you find out later.
Good luck :)
#2238 - col - Fri Jan 31, 2003 3:01 pm
animension wrote: |
Code: |
...
const unsigned long PLACEBO = 7;
...
int doatask(int input){
return (input / PLACEBO);
}
...
|
...The results were disappointing. I got 542 frames for the "/" operator versus 1826 frames for the ASM function. |
you are dividing by a const. gcc should optimise this to use a few shifts and adds!! check the asm output.
Change your code to use a 'volatile unsigned long' instead of a const...
(There is a section in this document about how to optimise divide by a const
http://infoeng.ee.ic.ac.uk/~gac1/Architecture/Progtech.pdf )
I suggest you use a broad range of divisors and dividends for a more accurate test - a divide that is fast dividing by 7 might be VERY slow dividing by 65534.
Whereas another algorithm that looks worse using 7, might be better over a large range...
cheers
col
#2281 - animension - Sat Feb 01, 2003 2:43 am
Wow that made a huge difference. The BIOS divide util was way ahead of the divide opration when the values were designated as volatile and unoptimized. Thanks a bunch!
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin