#23331 - Cearn - Sat Jul 10, 2004 12:29 am
I've been scouring the forum for details about the speeds of the swi routines, but I'm having very little luck here. Does anyone know where I can find their cycle-costs in one neat little package? (preferably gift-wrapped, bow optional)
Last edited by Cearn on Sat Jul 10, 2004 11:13 am; edited 1 time in total
#23332 - Lord Graga - Sat Jul 10, 2004 12:42 am
Hmm, disassemble the BIOS and start counting! ;)
#23334 - Cearn - Sat Jul 10, 2004 12:47 am
Lord Graga wrote: |
Hmm, disassemble the BIOS and start counting! ;) |
Uhmmm ... I was kinda hoping I didn't have to. Really, really hoping.
#23335 - dagamer34 - Sat Jul 10, 2004 1:15 am
Cearn wrote: |
Lord Graga wrote: | Hmm, disassemble the BIOS and start counting! ;) |
Uhmmm ... I was kinda hoping I didn't have to. Really, really hoping. |
I'm guessing it's a complicated process. Mind explaining it to me? I don't have much to do at the moment...
_________________
Little kids and Playstation 2's don't mix. :(
#23337 - poslundc - Sat Jul 10, 2004 1:41 am
All I know is anecdotal stuff, but from what I've heard both the bios functions tend to be algorithmically weak because they are optimized for space instead of speed, and will also do things like bounds checking that may be unnecessary for your application. Also, their speed is further impeded by the cost of issuing a SWI instruction which (although I'm not entirely sure why) takes about 50 cycles or so for the dispatch and return.
Dan.
#23338 - Cearn - Sat Jul 10, 2004 1:51 am
poslundc wrote: |
All I know is anecdotal stuff, but from what I've heard both the bios functions tend to be algorithmically weak because they are optimized for space instead of speed, and will also do things like bounds checking that may be unnecessary for your application. Also, their speed is further impeded by the cost of issuing a SWI instruction which (although I'm not entirely sure why) takes about 50 cycles or so for the dispatch and return.
Dan. |
Yeah, that's about what I could make out from reading some of the posts too. I'd read that they seemed optimised for space rather than speed and had a large overhead, but I'd still like to know what their costs actually are. I know DekuTree compared CpuFastSet and fixed-source DMA; I thought some people might have taken a closer look at some of the other BIOS functions as well.
#23339 - DekuTree64 - Sat Jul 10, 2004 2:37 am
Nope, I've never seen any speed comparisons on any of the others. I am a little curious myself, but I don't really have time to do detailed testing on them. Perhaps someone could make a full SWI profiling program. I think the most parameters any takes is 3, so you could make table of r0-r2 parameters, with a 4th parameter that's just a flag wether to run that SWI or not, and a swi function that you can give a number to in the 4th argument, like
Code: |
#define FIRST_SWI_TO_TEST 6 // divide, though you could start anywhere
#define SWIS_TO_TEST 20 // or however far you want to go
#define SWI_DO 1
#define SWI_SKIP 0
const u32 table[NUM_SWIS_TO_TEST][4] = {
{50, 7, 0, SWI_DO}, // for divide, do 50/7
{0, 0, 0, SWI_SKIP}, // for divArm, skip it because it's only for compatibility
{100, 0, 0, SWI_DO}, // for square root, do sqrt(100)
... SWIS_TO_TEST sets of args
};
swiTest(u32 arg1, u32 arg2, u32 arg3, u32 number)
{
asm volatile("swi r3"); // r3 is the 'number' argument
}
for(i = 0; i < NUM_SWIS_TO_TEST; i++)
{
if(table[i][3])
{
swiTest(table[i][0], table[i][1], table[i][2], FIRST_SWI_TO_TEST + i);
}
} |
And set up a timer before calling swiTest in that loop, and then print it after. Call the swi a bunch of times too, to get a more accurate reading.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#23384 - Miked0801 - Sun Jul 11, 2004 3:42 pm
Which particular calls were you looking for? I can single step run them through No$gba and let it add up the cycles for me. If you single step, it doesn't branch to other interrupts and will also give you a cycle accurate value. This is how I got the original cycle overhead values (and boy was I suprised!) It basically has to jump through 2 seperate table lookup - jumps and a bunch of push/pops before it even gets to the code. Same on the way out.
#23386 - Cearn - Sun Jul 11, 2004 3:55 pm
Miked0801 wrote: |
Which particular calls were you looking for? |
Right now, 0x06 (Div) up to 0x0f (ObjAffineSet). I'd also like to know how the normal integer division (you know, when you do 42/13) measures up to the BIOS version, but that might be pushing my luck.
#23427 - Miked0801 - Mon Jul 12, 2004 5:35 pm
The divide parts I have studied in depth. I'll run some quick test to be sure: GCC is just using the / for a divide, BIOS is calling the SWI, while ARM is a thumb callable ARM routine adapted from Jeff F.'s page. Note, the ARM routine calculates modulus for free (useful is situations where you need to do x / 10, x % 10) (BTW, all cycle counts are from No$GBAs profiling features and cycle counters.)
Ok, GCC Divide of 42 / 13 gives 251 cycles
BIOS Divide returns at 109 cycles
ARM Divide came back at 79 cycles
Now, I'll choose some larger numbers:
42756 / 1083
GCC: 339
BIOS: 161
ARM:78 cycles
One more
4275655 / 78
GCC: 507
BIOS: 291
ARM:110 cycles
Ok, we'll worst case it:
427565555 / 3
GCC: 747 cycles
BIOS: 447 cycles
ARM: 142 cycles
As you see, the BIOS overhead eats you up on smaller divides and on larger, the compact nature of it loses to a better written ARM one. I've not tried the others, but could if needed.
Mike
#23439 - tepples - Tue Jul 13, 2004 1:05 am
Typically, the GCC divide is Thumb code in ROM. What happens if you pull out the GCC divide and put it in IWRAM?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#23451 - FluBBa - Tue Jul 13, 2004 9:09 am
Quick note, the BIOS divide also returns the modulus in r1 if I'm not misstaken.
_________________
I probably suck, my not is a programmer.
#23471 - Miked0801 - Tue Jul 13, 2004 6:04 pm
Tepples:
If we pulled the GCC divide, it's thumb so it wouldn't take advantage of ARM 32-bit access. I've looked at it and it's also a looped version (not unrolled) so it is losing performance there as well.
FluBBa:
Thanks. I've always called it through C code so I didn't know that :)
#23592 - batblaster - Thu Jul 15, 2004 10:24 pm
Very Very good test for the division but can you tell me on how to make a div in arm ??? thanks a lot...
#23635 - Miked0801 - Fri Jul 16, 2004 5:47 pm
#23642 - batblaster - Fri Jul 16, 2004 9:05 pm
Many many Thanks for the division routines...
_________________
Batblaster / 7 Raven Studios Co. Ltd
------------------------------------------
#49593 - ymalik - Sun Jul 31, 2005 7:30 pm
Sorry for bumping this thread up, but I thought there wasn't any assembly instruction for division in ARM.
#49596 - tepples - Sun Jul 31, 2005 7:41 pm
There isn't. The GBA BIOS implements division in terms of repeated shifts and subtractions.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.