#168822 - Tyler24 - Wed May 27, 2009 9:51 pm
I have a text routine that assembled into less than a kilobyte or so, but it is incredibly slow. Takes a couple of solids seconds to spew out 64 lines of 20 8x8 fonts.
I think it's because the ARM7 and ARM9 processors are both running from main memory, and conflicting like no other. Therefore, I'd like to make use of the cache.
I know that you have to use the co-processor to enable the caches and configure the base for DTCM, but how does one go about moving a hunk of instructions to ITCM? Just copy a block of instructions to it, and move the program counter to the beginning of the block of instructions?
#168824 - Dwedit - Wed May 27, 2009 10:04 pm
It sounds like you're trying to do everything by hand from scratch. Not a very good idea. At least, use devkitpro's CTR0.s file and linkscript to help you. It sets up the cache and TCMs for you, then copies code to its eventual target address.
Then you just declare code to be in ITCM by adding a .section to your .s file:
.section .itcm, "ax", %progbits
Then the linkscript takes care of it.
Reminder: ITCM and DTCM are not the cache, they are separate fast memory areas. There's an old tutorial out there has them confused with the cache.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#168825 - kusma - Wed May 27, 2009 10:06 pm
Tyler24 wrote: |
...but how does one go about moving a hunk of instructions to ITCM? Just copy a block of instructions to it, and move the program counter to the beginning of the block of instructions? |
That's the general idea, but keep in mind that the code either have to be compiled as position independent (or assembled), or linked to the specific address you're putting it at. It sounds to me like you'd might like to use something similar of the IWRAM-overlay stuff that devkitARM provides for GBA.
#168826 - Tyler24 - Wed May 27, 2009 11:53 pm
Dwedit wrote: |
It sounds like you're trying to do everything by hand from scratch. Not a very good idea. At least, use devkitpro's CTR0.s file and linkscript to help you. It sets up the cache and TCMs for you, then copies code to its eventual target address. |
I'm not trying to reinvent the wheel, just completely understand it :)
I understood the entirety of the crt0 script, except for one small part. I understand that the Instruction TCM is mirrorable to 0x01000000, but what is the advantage of doing that?
Also, EDIT: DTCM/ITCM/caches work :D. I just copied 100mB of memory to the stack with the everything enabled, and with everything disabled. When I enabled everything, it copied at about 12.93mB/s. As expected, when I disabled everything it copied at about 8.58mB/s. Not exactly the magnitude I was hoping for, but that should be amplified a lot when I move the actual program into DTCM and run it from there.
#168829 - eKid - Thu May 28, 2009 5:34 am
Tyler24 wrote: |
But that should be amplified a lot when I move the actual program into DTCM and run it from there. |
ITCM you mean. :)
There's not really a lot of time to be saved by moving code into ITCM, the cache does a pretty good job at optimizing most things. (once a certain loop runs once it is loaded into the cache and it executes as fast as in ITCM)
#168839 - Miked0801 - Thu May 28, 2009 6:12 pm
As long as an interrupt doesn't come along and thrash your cache lines :)
ITCM has it uses. I've had 2 games that ran over 95% of its game code in ITCM due to clever profiling and segmentation. Having so little code touch the normal cache helps out everywhere else as well.
#168840 - Tyler24 - Thu May 28, 2009 6:41 pm
eKid wrote: |
Tyler24 wrote: | But that should be amplified a lot when I move the actual program into DTCM and run it from there. |
ITCM you mean. :)
There's not really a lot of time to be saved by moving code into ITCM, the cache does a pretty good job at optimizing most things. (once a certain loop runs once it is loaded into the cache and it executes as fast as in ITCM) |
In that case, something isn't clicking on my end. :(
I have an text routine that takes a zero-terminated ASCII string and writes the characters (8x8) to the framebuffer. With the ITCM/DTCM/caches disabled, it writes just as slow as it does with the ITCM/DTCM/caches enabled, around 64 lines of text in about 5 seconds or so. And I optimized the heck out of the routine too... it's as small as it's going to get.
Miked0801 wrote: |
As long as an interrupt doesn't come along and thrash your cache lines :)
ITCM has it uses. I've had 2 games that ran over 95% of its game code in ITCM due to clever profiling and segmentation. Having so little code touch the normal cache helps out everywhere else as well. |
That's what I'm trying to do =/. I figure 32k of code should be more than enough for most of my needs.
I'm currently having trouble figuring out how to actually go about calling the code from the ITCM. Don't really understand Dwedit's notation...
Ideally, I'd like to be able to dynamically load code to the ITCM (during a loading screen or something), and call it when I need it.
I'm trying to look on google and at the link script from devKitARM to figure out how to assemble code as position independent, but nothing has worked so far -- the emulator just stalls, and the real hardware never actually runs the code. I tried assembling with arm-eabi-as --apic ... no dice. I'm currently using the code from the ds_arm9_crt0.s that copies code into the ITCM, so it should be in the ITCM, it's just that whenever I call it with like (mov r0, #0 / bx r0), as I said before the emulator stalls and the real hardware doesn't run it.
EDIT: ITCM is MIRRORED to 0x01000000, but it should still run code at 0x0, no?
#168843 - Tyler24 - Fri May 29, 2009 3:22 am
Got it figured out. Code must have been flaky before.
ANYWHO, the ITCM code runs several, several times faster than the main memory code. I'm a little confused though, because everyone here said that small loops in the main memory should find their way into the cache, and run about just as fast as the ITCM code, no? Well, that doesn't appear to be the case for me. I've triple checked via text outputs on real hardware - caches ARE enabled...
Why would something like this run MUCH slower (maybe 10x, or more) in main memory? Again, tests were run on actual hardware...
Although I do not flush the cache before entering this loop, interrupts are disabled... I'm scratching my head.
Code: |
mov r0, #0xFF
add r0, r0, #0xFF00
add r0, r0, #0xFF0000
_someloop:
subs r0, r0, #1
bne _someloop |
#168847 - FluBBa - Fri May 29, 2009 11:49 am
In what regions are your instruction cache enabled? (might as well tell us where your data cache is enabled).
_________________
I probably suck, my not is a programmer.
#168849 - Miked0801 - Fri May 29, 2009 4:57 pm
That code, even if it happened across 2 cache lines due to location, should still cache after 1 iteration and run at speed. Dunno what to tell you. Are you calling a function from within your loop or is it as pure as you are showing?
#168850 - Tyler24 - Fri May 29, 2009 5:18 pm
FluBBa wrote: |
In what regions are your instruction cache enabled? (might as well tell us where your data cache is enabled). |
I looked over my code after you said this, and right before I hit the submit button I realized that I was using hexadecimal (0x01000010) to enable the cachability bits, when I should be using binary (0b01000010). Now it works like it should... thanks for the help.
So there's 8k of cache in the ARM946E-S?