gbadev.org forum archive

Looking over my projects, I've wondered how I could gain some more speed from them, and for some reason, I ended up looking into the build folder, and found out something really odd. It appeared that the object file that contained all the code I wanted in ITCM was, to my surprise, almost 64KB big. Considering that the code portion of ITCM could only hold 32KB, this lead me to removes all traces that put code from this file into ITCM, and it seemed it made not one single difference in how my program ran (program-wise and speed-wise). It felt like this whole time, I've been running this code in main RAM.

A silly question, but because of this, am I to assume that if code is too large to fit in ITCM, it will just not put it there (even a portion of the code), and put it in main RAM? I really hope so, because that means that I can adjust this section of my code so everything that definitely needs to be there can be there, and everything else can be moved off to main RAM.
_________________
DS - It's all about DiscoStew

You could always run arm-eabi-nm on your arm[#].elf to see whether a function is being put in ITCM or in EWRAM.

EDIT: generalized
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

Last edited by tepples on Tue Nov 27, 2007 10:46 pm; edited 1 time in total

It's actually part of my arm9 code, but that shouldn't make a difference. I'll look into it when I get the chance today. thx.
_________________
DS - It's all about DiscoStew

I've seen the linker error when I have too much code in ITCM. It has never just silently moved code from ITCM to ram for me.

Object files are always bigger than the code they contain, due to link symbols and the like. If the code in your main loop is small enough, it will all fit into the instruction cache, which gives the same speed as ITCM. Alternatively, if you're doing many memory accesses over a large address range and not much processing, then data access time may be the limiting factor for the execution speed.
_________________
http://chishm.drunkencoders.com
http://dldi.drunkencoders.com

Well drat! After checking with tepples suggestion, it turns out that it does fit into ITCM as it only takes up about 8.5KB of space, and if I'm correct, even if I don't set it specifically into ITCM, it will get cached, and run just as fast, which is why I didn't see any change when removing it out of ITCM. But, isn't the instruction cache only about 8KB big? How would my function fit into there if it is 1/2 a KB bigger?

Anyways there is a lot of accessing of the main RAM in my code, and although I've tried putting information into it as sequentially as possible, it still seems a little slow. Considering there is the cache for both instruction and data (8KB and 4KB), maybe I can put the data cache to some use, unless it is already in use. I just am not sure how to do it.
_________________
DS - It's all about DiscoStew

I think the advantage to ITCM was from the gba days, where there was a lack of cache. Having this should make a big difference when you lack cache, but will likely not be too noticeable if it is present. So I don't there's a mecha streisand benefit for the DS.

However,
- the instruction cache is small, and by using ITCM you can be sure that there's no wait whilst it "loads it into the cache" since it's already in the fast memory
- it won't pollute the real caches (right?) so you effectively get more cache
- it's 32kb extra free RAM, which can always be handy
_________________
Big thanks to everyone who donated for Quake2

simonjhall wrote:

Having [a block of guaranteed fast instruction memory] should make a big difference when you lack cache, but will likely not be too noticeable if it is present. So I don't there's a mecha streisand benefit for the DS.

Good point. So are you Robert Smith, or are you part of the disease?

Quote:

However,
- the instruction cache is small, and by using ITCM you can be sure that there's no wait whilst it "loads it into the cache" since it's already in the fast memory

Are you talking about compulsory misses (first load which takes less than a frame) or capacity misses (subroutine gets evicted by another subroutine)?

Quote:

- it won't pollute the real caches (right?) so you effectively get more cache

Has anyone done benchmarks to verify this?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

The cache is set not to do anything for code/data in either DTCM or ITCM. This is all set up for you by the crt0.
_________________
http://chishm.drunkencoders.com
http://dldi.drunkencoders.com

tepples wrote:

Good point. So are you Robert Smith, or are you part of the disease?

Robert Smith walked right past me at work once! My boss was bouncing around in his chair asking me "did you see him? That was ROBERT SMITH!" Who?

Quote:

Are you talking about compulsory misses (first load which takes less than a frame) or capacity misses (subroutine gets evicted by another subroutine)?

Both...

Quote:

Has anyone done benchmarks to verify this?

I did some dozy tests to see how much better code runs from ITCM, but the performance of the code was quite dependent on the locality of the data so there wasn't a huge difference. However I've done a load of cycle timings on all the different types of RAM and the speed of DTCM-style memory is comparable to that of cached main memory, just without the start up cost.
_________________
Big thanks to everyone who donated for Quake2

To get the best performance from the ITCM you have to take care with code that you put in it. Quoting the ARM946S-E tech ref:

Quote:

5.5.2 Data accesses to Instruction TCM
Data accesses to the Instruction TCM can either be reads or writes.
Data access to the Instruction TCM can introduce stall cycles to the ARM946E-S
processor.

5.5.3 Stall cycles for Instruction TCM accesses
Simultaneous instruction fetch and data reads of the Instruction TCM incur a single stall
cycle. This is because the Instruction TCM is a single port memory, which can only
return a single word of memory per clock cycle. This is shown in Figure 5-3 on
page 5-10.

GCC can generate ldr instructions that load data from within the machine code. It does this to load a register with some immediate value, for example the base address for further memory accesses or some mask/comparison value.

The size of the code does not indicate how much performance gain it will get by the presence of a cache. It is the structure of/path through the code that matters.
For example a loop (less than the size of the cache) that is taken a number of times will benefit from a cache as it willl be loaded into the cache on the first iteration and then happily sit in the cache for each subsequent iteration.

Generally, if you want something to run faster always start at the larger scale and make sure you are doing things as efficiently as you know how. Poking around with the ITCM and cache may give benefits but requires that you know the path through your code at the machine code level and the operation of the target CPU.

gbadev.org forum archive

DS development > ITCM, and code meant to go there that is too big

#146021 - DiscoStew - Tue Nov 27, 2007 8:27 pm

#146023 - tepples - Tue Nov 27, 2007 8:33 pm

#146038 - DiscoStew - Tue Nov 27, 2007 10:45 pm

#146047 - ingramb - Wed Nov 28, 2007 12:05 am

#146055 - chishm - Wed Nov 28, 2007 2:29 am

#146162 - DiscoStew - Thu Nov 29, 2007 9:52 pm

#146174 - simonjhall - Fri Nov 30, 2007 1:05 am

#146175 - tepples - Fri Nov 30, 2007 2:27 am

#146179 - chishm - Fri Nov 30, 2007 2:44 am

#146197 - simonjhall - Fri Nov 30, 2007 10:52 am

#146198 - masscat - Fri Nov 30, 2007 11:09 am