#40221 - DekuTree64 - Sat Apr 16, 2005 9:20 am
We argued, I made a tester.
Get it here, and run it on passme (crashes Dualis, appearently doesn't support ITCM, or doesn't like moving the TCMs around). It should build straight away if you call make in the main folder. The build.bat there is for me to build from MS visual studio, and has my specific paths in it, so it probably won't work.
The app just runs a series of tests and prints out the cycles they used (plus a few to start/stop the timer to count them). If you know a little ARM assembly, it should be easy to add more tests. Just write a little function (follows APCS), and branch to it the way all the others do. I marked the 2 places where you need to add things with CHANGEME, for easy searching to.
The code is copied to ITCM, which is mapped to address 0 (and mirrored up to 0x2000000). DTCM is mapped to 0x2FFC000 during the tests. No real reason for those addresses other than that I like them there, because they're both in branching range from main RAM, and easy to get to with immediates in assembly :)
The stack is put out in main RAM at 0x02300000, because of the DTCM being yanked out from under it.
So then, results. It seems to act just like I expected, with most things taking 1 cycle, or taking the same as GBA in the case of an interlock. For example, a lone ldr instruction takes 1 cycle, but an ldr followed but an add involving the loaded value takes 3 cycles.
A plain subs, bne loop takes 4 cycles each pass, same as GBA.
Main RAM waitstates are killer, taking 18 cycles to load a word, and locking the CPU during that. They're much friendlier on ldmia, taking 46 cycles to load 8 words.
An ldr from DTCM takes 1 cycle normally. 2 consecutive loads takes 2 cycles. An ldmia with 8 regs takes 8 cycles. Awesomeness.
DSP instructions (I only checked smulxy and qadd) take 1 cycle, or 2 cycles if the next instruction uses the result. Bummer, because it would be really nice to just smulxy, qdadd, smulxy, qdadd, etc. to do sound filtering and stuff. Oh well, if you use smlaxy repeatedly and just make sure you won't overflow, it's only 1 cycle each, and won't interlock (not tested, but the docs say).
So, I think the only thing that will ever be slower than the ARM7 will be mov/ldr to the PC, which I think was 1 more cycle (forgot to try it, and too sleepy now). Since branch is the normal way to get around and it's still the same speed, ARM9 is better in almost every way.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
Get it here, and run it on passme (crashes Dualis, appearently doesn't support ITCM, or doesn't like moving the TCMs around). It should build straight away if you call make in the main folder. The build.bat there is for me to build from MS visual studio, and has my specific paths in it, so it probably won't work.
The app just runs a series of tests and prints out the cycles they used (plus a few to start/stop the timer to count them). If you know a little ARM assembly, it should be easy to add more tests. Just write a little function (follows APCS), and branch to it the way all the others do. I marked the 2 places where you need to add things with CHANGEME, for easy searching to.
The code is copied to ITCM, which is mapped to address 0 (and mirrored up to 0x2000000). DTCM is mapped to 0x2FFC000 during the tests. No real reason for those addresses other than that I like them there, because they're both in branching range from main RAM, and easy to get to with immediates in assembly :)
The stack is put out in main RAM at 0x02300000, because of the DTCM being yanked out from under it.
So then, results. It seems to act just like I expected, with most things taking 1 cycle, or taking the same as GBA in the case of an interlock. For example, a lone ldr instruction takes 1 cycle, but an ldr followed but an add involving the loaded value takes 3 cycles.
A plain subs, bne loop takes 4 cycles each pass, same as GBA.
Main RAM waitstates are killer, taking 18 cycles to load a word, and locking the CPU during that. They're much friendlier on ldmia, taking 46 cycles to load 8 words.
An ldr from DTCM takes 1 cycle normally. 2 consecutive loads takes 2 cycles. An ldmia with 8 regs takes 8 cycles. Awesomeness.
DSP instructions (I only checked smulxy and qadd) take 1 cycle, or 2 cycles if the next instruction uses the result. Bummer, because it would be really nice to just smulxy, qdadd, smulxy, qdadd, etc. to do sound filtering and stuff. Oh well, if you use smlaxy repeatedly and just make sure you won't overflow, it's only 1 cycle each, and won't interlock (not tested, but the docs say).
So, I think the only thing that will ever be slower than the ARM7 will be mov/ldr to the PC, which I think was 1 more cycle (forgot to try it, and too sleepy now). Since branch is the normal way to get around and it's still the same speed, ARM9 is better in almost every way.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku