gbadev.org forum archive

I've finally gotten off my arse and have written the profiler I needed to improve the performance of Q2. If y'all remember I wrote a function-level instrumenting profiler nearly two years ago which, after being run, spits out a list of all the functions called, how many times they were run, how long each call took etc. Well the problem I had was that although it would often clearly point me to the slow functions I often had no idea whereabouts within the function the bottlenecks were.
And here's the tool that fixes that :-)

This new tool runs live on your DS and returns per-instruction information about how many times any given instruction was called, how many stall cycles it incurred, how the instruction and data caches perform and dumps out a fat spreadsheet telling you all about how your program performed.

Still needs a lot of work, hence me not releasing it *right now* plus I've got a few questions for you guys first!
- I'm reading the ARM9 manual and although there's loads of juicy information about data hazards and interlocks (which are essential to figuring out how long a piece of code takes to run) there's no mention of the condition/status fields in the CPSR. Can the CPSR cause RAW/WAW/WAR hazards? (I'm thinking of a MULS followed by an ADDS)
- the memory timings on gbatek look quite different from what I get, from all memory spaces (apart from cached access). Anyone got any way of verifying the information that's on there?
- gbatek also mentions bus wait states and how "it might go down to 1.6MHz" or something. Err...what?

Any help/perls of wisdom would be appreciated.
Simon
_________________
Big thanks to everyone who donated for Quake2

The memory timings on gbatek are mostly for arm7/gba, the arm9 has a bunch of different timings for its instructions. The cycle timings on there also don't pay much attention to the arm9's separate code/data buses.

About that 1.4mhz thing, I'm not really sure what he meant.. I'm not really sure how the cache fetches instructions either, but if it has to load 8 words [cache miss] with the 9 cycle main memory [code] access time, then I can imagine it being a pretty heavy load (?? 33mhz cycles) when the cache misses.

EDIT:
I think I see it now: (2^25) / 24 [cycles] = ~1.4mhz
24 cycles = 10 + 2*7, (1N + 7S data cycles [8 word cache load])

Oh right, yeah, that's what I would have thought too. The web page makes it sound a lot worse than what's really going on I guess! The bus is working as expected.
And re: the timings he gets versus my timings, I think I'll have to have a play a bit until I'm happy one of us is right!
_________________
Big thanks to everyone who donated for Quake2

Will your profiler work for arm7 programs too? :)

Haha, maybe! It'll execute them 100% correctly (as it's done by the cpu itself) but at the moment all the pipelining is done with the details in the ARM9 manual. Some instructions may need a wiggle (eg multiply) when going from ARM9->ARM7...
ARM9 thumb probably comes first though as I can't yet handle functions that branch to thumb.
_________________
Big thanks to everyone who donated for Quake2

How the heck...

I guess an interrupt could keep glancing at r15 to see which exact instruction you're on... but how are you telling whether that particular instruction caused a stall or a cache miss??

Haha, that's not a bad idea actually. You could stick a timer on the fastest setting possible and get it to cause interrupts and then rock it that way. Sounds a bit hairy, though!

Nah the one I'm cooking is more like an emulator. Sequences of boring instructions (eg data processing ones that's don't use r15) are run live on the machine with those blocks profiled by disassembling them. Only juicy instructions such as loads/stores, branches and instructions that read from or write to r15 are emulated. Conditional execution of instructions requires access to the status bits so they have to be run on their own.
This obviously doesn't run in real-time :-)
_________________
Big thanks to everyone who donated for Quake2

Oh, fancy! =)

Haha!
Sounds very interesting! (And fun to program. ;^D)
Yeah, it doesn't have to run in realtime, as long as you get the numbers it's ok. :^D ... Given you have a demo playbeack system in your game. ':^/
_________________
July 5th 08: "Volumetric Shadow Demo" 1.6.0 (final) source released
June 5th 08: "Zombie NDS" WIP released!
It's all on my page, just click WWW below.

This sounds like it will be super useful. I'm sure you will make a lot of developers (even official ones) very happy when it's released.

simonjhall wrote:

Nah the one I'm cooking is more like an emulator. Sequences of boring instructions (eg data processing ones that's don't use r15) are run live on the machine with those blocks profiled by disassembling them. Only juicy instructions such as loads/stores, branches and instructions that read from or write to r15 are emulated.

And with ARM's "constant pool" architecture, programs are LDRing through r15 all the time. It's not like MIPS, where a program will typically load a 16-bit constant << 16 and then OR in another 16-bit constant. How do you plan to handle that?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

simon - have you taken a look at the microsoft giano source code? it includes code to emulate arm processors. not sure if it would be helpful...

elhobbs wrote:

simon - have you taken a look at the microsoft giano source code? it includes code to emulate arm processors. not sure if it would be helpful...

That looks pretty cool actually, although I'm not really too sure who it's meant for :-) (I read about it last night, but I'm kinda ill atm and was probably out of it at the time...)

@tepples - at first I was replacing just pc-relative load/stores, but now I emulated ALL of them as you can never be sure of what address a program is trying to load until you decode the operands and read from the register file etc.
eg you could do:

Code:

mov lr, pc
ldr lr, #32

...and that'd be pc-relative despite it going through r14.
_________________
Big thanks to everyone who donated for Quake2

Just some initial findings from playing around with this thing, as I've just got instruction and data caching into the works. I've got this loop where I allocate 123 bytes of memory at a time until I run out.

Code:

do
{
//    unsigned int output = execute_function(func_ptr, args, &icache, &dcache);
   unsigned int output = (unsigned int)malloc(args[0]);

   total_insts += instructions_run;
   total_clocks += clocks_elapsed;

   ptr = (void *)output;
   runs++;

   if ((runs % 100) == 0)
      printf("%d (%d %d)\n", runs, total_insts, total_clocks);
} while (ptr != NULL);

(too lazy to remove comments)
Now despite malloc actually being a small function with a suprisingly-good run time (just 86 instructions/call on average) yet it makes terrible use of the cache!

Here's some useless numbers:
- icache: 4k total, 32b/line, 4-way set assoc = 79% hit rate
- dcache: 4k total, 32b/line, 4-way set assoc = 53% hit rate *

- icache: 4k total, 32b/line, 8-way set assoc = 85% hit rate
- dcache: 4k total, 32b/line, 8-way set assoc = 90% hit rate

- icache: 4k total, 32b/line, fully assoc = 99% hit rate
- dcache: 4k total, 32b/line, fully assoc = 91% hit rate

- icache: 8k total, 32b/line, 4-way set assoc = 84% hit rate *
- dcache: 8k total, 32b/line, 4-way set assoc = 63% hit rate

- icache: 8k total, 32b/line, 8-way set assoc = 85% hit rate
- dcache: 8k total, 32b/line, 8-way set assoc = 90% hit rate

- icache: 8k total, 32b/line, fully assoc = 99% hit rate
- dcache: 8k total, 32b/line, fully assoc = 91% hit rate

* means this is what gbatek says the ARM9 has.
Although this is quite a crappy example, it does show some interesting results. I think the low dcache hit rate is from conflict misses - malloc must write some magic number per-allocation at a place which hashes to the same address or something. What I thought was bizarre was that although going from 4->8-way set associativity on the icache doesn't really do a lot, going from 8->16 is a HUGE jump! (16 not shown btw)
Also, with this example, cache capacity doesn't really affect the results too much which is not what I had expected...

Btw changing the line size doesn't really do too much.

Time to run this on Quake's renderer :-)
_________________
Big thanks to everyone who donated for Quake2

Even if it's just a toy benchmark, that's surprising about the number of pages being so important compared to the cache size -- I had always assumed the opposite.

Shall we optimize by arranging objects to fall on pseudorandom addresses to reduce page clashes?? I'll start padding struct sizes to the next prime number! ;)

Slight update on this - I've now fixed a bunch of tiny bugs which allow me to profile huge amounts of (ARM) code without error and whatever I try it on gives me the same kind of result: assuming all memory accesses take just one cycle (causing a one-cycle interlock if the result is used in the next instruction) most code has a CPI of about ~1.2, which is pretty damn good!

However as soon as you stick the instruction and data cache into the mix, performance really starts to go south! I had never thought of it before but if you run a straight-line piece of code for the first time, you'll get an icache miss every eight instructions (in ARM mode) which is terrible! I guess the message is: get your frequently-run code into itcm, and stick you main-memory code in THUMB mode and make it short and loopy. Pretty obvious, but it's nice to see numbers confirming this.
Quake's renderer gives about ~80% hit rate in the icache.

Also the data cache performs really badly in Quake's renderer (despite it being the fastest code in the game), with like a ~50% hit rate. More testing is needed here...hmm...

Anyone got any examples of whether or not the PLD opcode works or not on the ARM9?
_________________
Big thanks to everyone who donated for Quake2

But without the instruction cache, code in mainram would be much slower (since arm9 doesn't support sequential opcode fetches), that's about +16/+7 (arm/thumb) cycles each instruction.

I used the PLD opcode once, I'm not really sure how it works, but sticking it in semi random places sped up my code.. :P (but putting it in some other reasonable places caused slow downs :\).

I'm fairly certain that PLD does nothing on the DS.

I would think gbatek would leave a note in the PLD description if it didn't exist. :\

I'm still having a hard time figuring out exactly how the memory interface on the ARM9 works, as it's *definately* not as clear cut as gbatek seems to make it! I have a feeling that although the caches are tiny, I don't think that the hardware in there is necesarily basic. For instance, due to the behaviour of load-multiple and possibly instuction fetching I think that the cpu employs an early restart / critical word first scheme to reduce miss penalties. There's more stuff going on, so I'll stop rambling and show some examples at some point! I'll also test for pld-ness.

Anyway - the good stuff! Seems my numbers I listed before for malloc were wrong (but the relationships between the results were correct) so please ignore my info on the cache performance of malloc. I've now profiled my version of Quake 1's renderer and the results are quite shocking. Although this is useful to me, it may give some ideas for other people trying to speed up their code.

Instruction cache misses: 80% hit rate with a 4-way 8k cache; this is not good! Even if I drop the cache size to 32 BYTES the hit rate doesn't suffer too much! The game seems to get decent icache results when I set the cache at around the 32k mark - the size of itcm! I originally put all the expensive and often-called functions into itcm back in v3 of Q1 but seems I chose the functions poorly, as only 24% of a frame of Quake is run from itcm. After a few hours of profiling 75% of all code is now run from itcm, icache misses have dropped by a factor of 5 and hit rate is up at 92%!

Data cache performs absolutely terribly but I'm still working on that so not much info there, except that although the hit rate's only 52% on average, there are less data misses that instruction misses (in Q1v3) hence me concentrating on the icache misses.

I'll be releasing this tool in a while for people to play with. Can't believe I didn't write this sooner!

PS: just saw The Dark Knight. For a Batman film, he's not really in it that much is he?!
_________________
Big thanks to everyone who donated for Quake2

Simonjhall: I find this project and reading your progress on it to be very fascinating. Keep it up! :)

BTW, since it looks like your port of Quake is currently the test for the profiler, is there a slight chance the improvements you find might work there way into an unoriginality unplanned for, slightly improved QuakeDS? :)

Simon: Yes.. very interesting!

Soooo How much faster is it?!

Very interesting. I sure will have a look at it when you release it. :^)
Keep it up, man!
_________________
July 5th 08: "Volumetric Shadow Demo" 1.6.0 (final) source released
June 5th 08: "Zombie NDS" WIP released!
It's all on my page, just click WWW below.

Doom5 wrote:

BTW, since it looks like your port of Quake is currently the test for the profiler, is there a slight chance the improvements you find might work there way into an unoriginality unplanned for, slightly improved QuakeDS? :)

Too early to tell man...plus still waiting for adhoc and Stephen Stair's new wifi lib!
_________________
Big thanks to everyone who donated for Quake2

simonjhall wrote:

However as soon as you stick the instruction and data cache into the mix, performance really starts to go south! I had never thought of it before but if you run a straight-line piece of code for the first time, you'll get an icache miss every eight instructions (in ARM mode) which is terrible! I guess the message is: get your frequently-run code into itcm, and stick you main-memory code in THUMB mode and make it short and loopy.

So it's not much different from the GBA, where the message was also: get your frequently-run code into iwram, and compile the main-memory code in Thumb.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

tepples wrote:

So it's not much different from the GBA, where the message was also: get your frequently-run code into iwram, and compile the main-memory code in Thumb.

Spot on. I'm annoyed though that if I compile code for ITCM in THUMB it moans at me unless I stick it in ARM...
(what's shocked me though with the whole thing is that I *never* care about instruction caching since I normally program for architectures where you don't have to care about instruction prefetch)
_________________
Big thanks to everyone who donated for Quake2

gbadev.org forum archive

DS development > A new kind of profiler

#160785 - simonjhall - Tue Jul 22, 2008 10:32 am

#160788 - eKid - Tue Jul 22, 2008 12:15 pm

#160817 - simonjhall - Wed Jul 23, 2008 12:18 am

#160819 - eKid - Wed Jul 23, 2008 12:53 am

#160834 - simonjhall - Wed Jul 23, 2008 7:52 am

#160880 - sajiimori - Wed Jul 23, 2008 11:25 pm

#160921 - simonjhall - Thu Jul 24, 2008 3:23 pm

#160933 - sajiimori - Thu Jul 24, 2008 6:12 pm

#160934 - silent_code - Thu Jul 24, 2008 6:31 pm

#160959 - ingramb - Thu Jul 24, 2008 10:43 pm

#160976 - tepples - Fri Jul 25, 2008 1:05 pm

#160991 - elhobbs - Fri Jul 25, 2008 7:07 pm

#161031 - simonjhall - Sat Jul 26, 2008 1:15 pm

#161041 - simonjhall - Sat Jul 26, 2008 8:59 pm

#161042 - sajiimori - Sat Jul 26, 2008 10:05 pm

#161108 - simonjhall - Mon Jul 28, 2008 1:34 pm

#161112 - eKid - Mon Jul 28, 2008 2:26 pm

#161122 - ingramb - Mon Jul 28, 2008 6:59 pm

#161143 - eKid - Tue Jul 29, 2008 5:15 am

#161222 - simonjhall - Thu Jul 31, 2008 12:57 am

#161228 - Doom5 - Thu Jul 31, 2008 1:39 am

#161229 - TwentySeven - Thu Jul 31, 2008 1:58 am

#161258 - silent_code - Thu Jul 31, 2008 10:42 am

#161261 - simonjhall - Thu Jul 31, 2008 11:36 am

#161357 - tepples - Fri Aug 01, 2008 10:14 pm

#161361 - simonjhall - Fri Aug 01, 2008 11:12 pm