gbadev.org forum archive

I found this thread:

http://forum.gbadev.org/viewtopic.php?t=5983

But the snippets of provided code aren't working for me.

Does anyone have a generic timing class built for the DS?

I'm working on one, and I plan to change it to take as a parameter the duration as well as using multiple timers, but I'm not getting the timer firing.

The code I'm using is pulled from that other thread and looks like:

Code:

      irqInit();
      irqEnable(IRQ_VBLANK);

      irqSet(IRQ_TIMER0, TimerFunc);
      TIMER0_CR = TIMER_ENABLE | TIMER_DIV_256;
      TIMER0_DATA = TIMER_FREQ_256(100);
      irqEnable(IRQ_TIMER0);

The function TimerFunc is not getting called.

I think I'm missing something simple and obvious, but I'm not sure what. The previous post says to set bit 6 of TIMER0_CR, but I also did that and I still didn't get the interrupt firing.

This isn't working on either Dualis or hardware.

Thanks for helping,
Chris

Code:

#include <nds.h>
#include <stdio.h>

volatile int count=0;

//---------------------------------------------------------------------------------
void TimerFunc() {
//---------------------------------------------------------------------------------
count++;
}

//---------------------------------------------------------------------------------
int main(void) {
//---------------------------------------------------------------------------------

irqInit();
irqSet(IRQ_TIMER0, TimerFunc);
TIMER0_CR &= ~TIMER_ENABLE; // not strictly necessary if the timer hasn't been enabled before
TIMER0_DATA = TIMER_FREQ_256(100);
TIMER0_CR = TIMER_ENABLE | TIMER_DIV_256 | TIMER_IRQ_REQ;
irqEnable(IRQ_TIMER0 | IRQ_VBLANK);

videoSetMode(0); //not using the main screen
videoSetModeSub(MODE_0_2D | DISPLAY_BG0_ACTIVE); //sub bg 0 will be used to print text
vramSetBankC(VRAM_C_SUB_BG);

SUB_BG0_CR = BG_MAP_BASE(31);

BG_PALETTE_SUB[255] = RGB15(31,31,31); //by default font will be rendered with color 255

//consoleInit() is a lot more flexible but this gets you up and running quick
consoleInitDefault((u16*)SCREEN_BASE_BLOCK_SUB(31), (u16*)CHAR_BASE_BLOCK_SUB(0), 16);

while(1) {
swiWaitForVBlank();
iprintf("\x1b[0;0H%d",count);
}

return 0;
}

Things to note

The timer is disabled before setting the data, not strictly necessary unless the timer has been previously enabled.
Timer data is set before the timer is enabled
Variables adjusted in the interrupt are marked volatile
The irq request flag is set.

_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

That worked perfectly, thanks much!

Is there any way to get fairly precise timing profiles of different bits of code? There's always the fallback of using the most precise timers available internally and then just running the code 10,000 times (or some arbitrarily large number of times) and taking the total time / 10,000 to guestimate the time it takes. As long as other interrupts are running and such, it won't be a precise estimate, but if all I care about is to know that algorithm A is roughly 4 times faster than algorithm B, that should be precise enough.

Alternatively, there is the no$gba emulator that, in theory anyway, can provide very good code profiles.

Are there any other working DS profilers?

And thanks again for the help,
Chris

Chris Holmes wrote:

all I care about is to know that algorithm A is roughly 4 times faster than algorithm B, that should be precise enough.

If you need to compare algorithm times, you shouldn't care about scalar differences. It doesn't really matter if for any amount of input algorithm A is 4 times faster then B, because this factor is just the specific implementation, and can easily be changed.

You should however care if an algorithm multiplies its runningtime, when you increase the limiting factor, or does it keep constant? Lets take an example. Do you know the fibonacci row?

0 1 1 2 3 5 8 13 21 34 55 89 ....

Its native algorithm is

fib(n) = (n<2)?n:fib(n-1)+fib(n-2) ;

Its time-cost raises in the same manner as the value it calculates.

Howerer, there is an iterative solution solving fib(n) in (n-1) steps. The timecost is linear to n.

I wouldnt recomment testing the speed difference like you mentioned. You should keep analysing the algorithm itself. Why? If you do this test with a moderate n=100. You wouldn't want to wait for it. The native fib(n) would call itself 354224848179261915075 times.

For speeding up implementations, i would recomment using the build-in optimizer.
_________________
GBAMP Multiboot

Or you could say, "Look up Big-Oh notation."

Since you're obviously curious, what I'm testing is my alternative math libraries for the DS against the current look up table and various ways of implementing them. There are a couple of ways to handle the lookups: direct lookup of full tables, direct lookup for sine but use the sine table for cosine (incurs a single addition cost), using 1/4 of the sine table and incurring about 2 branches. Also, while I'm at it, I want to know if gcc is bright enough to treat % power_of_2 as a >> or not. If it's really doing a divide, that would incur a significat penalty, and that would be bad and slow.

But while I'm at it, yes, I know what Big-Oh notation is. I asked about code profiling, not Big-Oh. There's a difference.

Chris

You mean % power of 2 as & surely?

gcc can be pretty clever with certain code - the best way to investigate is to have a look at the assembly output for a given piece of code. Add -save-temps to your CFLAGS and the assembly output will be saved.

Code:

unsigned int modulo(unsigned int test) {

unsigned int x = test % 128;
unsigned int y = x % 64;
x = y % 32;
return x;
}

.code 16
.thumb_func
.type modulo, %function
modulo:
.LVL0:
mov r3, #31
and r0, r0, r3
.LVL1:
@ lr needed for prologue
.loc 1 12 0
@ sp needed for prologue
bx lr
.LFE57:
.size modulo, .-modulo

it gets less clever with signed values.

division and multiplication by powers of two are optimised to shifts where appropriate.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

Chris Holmes wrote:

Since you're obviously curious, what I'm testing is my alternative math libraries for the DS against the current look up table and various ways of implementing them. There are a couple of ways to handle the lookups: direct lookup of full tables, direct lookup for sine but use the sine table for cosine (incurs a single addition cost), using 1/4 of the sine table and incurring about 2 branches.

What looks like a branch may not be a branch. If you're compiling to ARM code, GCC can often optimize a branch around a single line of code into a conditional execution. For instance, mapping a full cosine table lookup into a half cosine table (0..180) lookup would be a CMP then a RSBGT or similar, always taking two cycles.

Also consider space-time tradeoffs: if you optimize one part of the program for space, then you can optimize another part of the program for time. This depends on how often one method is called compared to another method. The difference can be dramatic: when I change tetanus.c from Tetanus On Drugs for GBA to use -O3 (optimize for time) instead of -O2 (optimize for a balance of space and time), it's roughly 10 KB bigger on top of the 35 KB total code size (before appending assets).

Quote:

Also, while I'm at it, I want to know if gcc is bright enough to treat % power_of_2 as a >> or not.

Depends. If it's unsigned, then it'll probably optimize. It's less likely to do so for signed types because of rounding issues. Look at the asm code that GCC emits (use -save-temps) to make sure.

The simplest, most reliable method of profiling on the GBA and Nintendo DS is to read VCOUNT at the start and end of an inner loop from within the game (so that differences in cache behavior don't affect anything).
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

The problem I have with letting GCC optimize code is that since everyone is using GCC, that gives me a reliable measure of how optimized it will be. GCC is a pretty lousy optimizing compiler, especially compared to things like Intel's compiler. When the best GCC can do is code that runs at about half the speed of Intel's compiler, well, that's just not super.
And I generally assume that most everyone is running at -o2, not -o3 or anything different and special. So if I can optimize something that I consider to be a support library, then it's probably worth a few minutes of my time.

I didn't think ARM supported conditional operations at all. That's an awfully CISC thing to put into a RISC core.

And yes, I'm aware of all the optimizations possible, the time-space trade-off, etc. I didn't spend all those years recently getting a CS and CmpE degree for nothing. And those degrees taught me everything that GCC isn't and almost can't do to optimize code.

Chris

Here's an interesting article.

http://www.coyotegulch.com/reviews/linux_compilers/index.html

GCC definitely does not produce code that runs at half the speed of Intel's compiler across the board.

Most 32bit ARM instructions can be conditional, it's a little like some RISC DSPs I've used in that regard.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

Well, gcc has definitely made some gains. But in the end, gcc is almost always playing catch-up and will never be able to match intel's compiler. The fact that Intel only has to really support one architecture whereas gcc supports a generic back-end is enough to guarantee that.

http://en.wikipedia.org/wiki/ARM_architecture

Fascinating. Every opcode (more or less) can be made conditional. That IS fascinating. That's interesting. I bet they have a pipeline just for handling those branch conditions. Then they can just execute all the instructions and then either commit them or not depending on the branch condition. That's a very interesting idea.

Chris

Chris Holmes wrote:

Fascinating. Every opcode (more or less) can be made conditional. That IS fascinating. That's interesting. I bet they have a pipeline just for handling those branch conditions. Then they can just execute all the instructions and then either commit them or not depending on the branch condition.

A non-taken instruction always uses one cycle, even if you have a 3-cycle ldr, 2-cycle str, 3-cycle branch, 2+n cycle ldm, 1+n cycle stm, or longer multiply.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

Where do you get that from?

At minimum, there's the cost of the conditional statement. So a non-taken instruction costs 1 cycle beyond the cost of determining the conditional. That just means that immediately after fetch, the condition is evaluated and the instruction is dropped from the pipeline if the condition is false.

And why does it feel like we're arguing about something simple?

Chris

Chris Holmes wrote:

Where do you get that from?

ARM7TDMI-S official manual.

Quote:

At minimum, there's the cost of the conditional statement.

Which happens in parallel with the rest of the instruction.

Quote:

And why does it feel like we're arguing about something simple?

Because unless everyone in a discussion has read the relevant ARM manual, it's hard for them to agree on a consistent set of terminology.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

I'd really recommend the Furber book if you're going to be doing a lot of GBA or DS work (not that I have, relatively speaking.) It's a much better introduction than just leaping in with the technical references.

ARM System-on-Chip Architecture

Quote:

Which happens in parallel with the rest of the instruction.

Right.

if( [some_random_memory_location_that's_not_cached] == 42 )
i++

Assuming that GCC turns that into a conditional i++, it's going to take a few cycles to load the memory location and compare it to 42.
That result gets saved into a register and that register is used for the next conditional.

If the conditional is false, then it costs 1 cycle. I haven't looked, but it's pretty safe to assume that the conditional value has to be stored in a register, which is exactly how they can check the condition in just one cycle.

Your statement is misleading. After the conditional is computed, then yes, the next instruction will either execute or only incur a 1 cycle penalty. That doesn't mean magic happens and you can suddenly execute a 30 cycle conditional expression in 1 cycle.

Yep, I'm right. http://work.de/nocash/gbatek.htm

The conditional evaluation looks at the flags register. Therefore, the cost of any conditional opcode is either [conditional cost] + 1 (for not taken) or [conditional cost] + [opcode execution time].

Therefore, saying that an opcode not taken takes one cycle is extremely misleading. Sure, it can take 1 cycle if the flags register is already set, but if you have to set it (aka, the condition), then it still requires time to compute the condition to set the flags register.

Chris

Chris Holmes wrote:

Quote:

Which happens in parallel with the rest of the instruction.

Right.

if( [some_random_memory_location_that's_not_cached] == 42 )
i++

That's not an instruction. That's a large piece of C.

Code:

int i;

if ( *( u8 *)42 == 42) i++;

compiles to :-

Code:

mov r3, #0
ldrb r2, [r3, #42] @ zero_extendqisi2
ldr r1, .L5
cmp r2, #42
ldreq r3, [r1, #0]
addeq r3, r3, #1
streq r3, [r1, #0]

the last three instructions take 1 cycle each when the condition is false.

Not too shabby really.

Quote:

Assuming that GCC turns that into a conditional i++, it's going to take a few cycles to load the memory location and compare it to 42.

load the memory location and compare to 42 are several instructions separate from the increment.

Quote:

Therefore, saying that an opcode not taken takes one cycle is extremely misleading. Sure, it can take 1 cycle if the flags register is already set, but if you have to set it (aka, the condition), then it still requires time to compute the condition to set the flags register.

Only if you decide that ARM can somehow shoehorn several C statements into one instruction.

Saying that an opcode not taken takes one cycle is extremely accurate. There's no other way to interpret it, it does what it says on the tin. Saying that the set conditional happens in parallel is perhaps a bit misleading - it depends what's being tested which is perhaps where you were going when you got a bit sidetracked.

In any case, if you want to worry about cycle timings at this sort of level perhaps you'd be better off writing assembly directly.

For many, many tips on gcc optimisation methods look around the forums for posts by Cearn.

Here's one to start you off

http://forum.gbadev.org/viewtopic.php?p=43025&highlight=#43025
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

Quote:

the last three instructions take 1 cycle each when the condition is false.

Not too shabby really.

Like I said, it's a fascinating hardware optimization. It's just proof that CISC isn't quite dead yet.

Quote:

Saying that an opcode not taken takes one cycle is extremely accurate

I agree, but he said instruction, not opcode. I was just taking issue with his wording. I would certainly consider if( [location] == 42) to be a single instruction, but that breaks down into multiple opcodes, which is exactly what my point was.

Chris

Chris Holmes wrote:

The conditional evaluation looks at the flags register. Therefore, the cost of any conditional opcode is either [conditional cost] + 1 (for not taken) or [conditional cost] + [opcode execution time].

Nope.

The conditional check is done parallel to preparing the path in the ALU. The condition check uses only a few fast gates, if the condition code was not meet at the time the alu is prepared for the operation, operation aborts with no extra costs.

The cost is therefor either
1
or
[opcode execution time]

And yes this is totally parallel, as conditional statement is part of the statement and no additional prefix.
_________________
GBAMP Multiboot

Quote:

I would certainly consider if( [location] == 42) to be a single instruction, but that breaks down into multiple opcodes, which is exactly what my point was.

It's a C statement, not an instruction. It's also not a conditional check it's a comparison.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

gbadev.org forum archive

DS development > Performance timing

#82232 - Chris Holmes - Thu May 04, 2006 6:17 pm

#82246 - wintermute - Thu May 04, 2006 7:32 pm

#82257 - Chris Holmes - Thu May 04, 2006 8:52 pm

#82261 - Mighty Max - Thu May 04, 2006 9:28 pm

#82264 - Chris Holmes - Thu May 04, 2006 9:43 pm

#82268 - wintermute - Thu May 04, 2006 10:06 pm

#82272 - tepples - Thu May 04, 2006 10:25 pm

#82354 - Chris Holmes - Fri May 05, 2006 2:49 pm

#82367 - wintermute - Fri May 05, 2006 5:36 pm

#82368 - Chris Holmes - Fri May 05, 2006 5:54 pm

#82375 - tepples - Fri May 05, 2006 8:36 pm

#82379 - Chris Holmes - Fri May 05, 2006 9:01 pm

#82382 - tepples - Fri May 05, 2006 9:16 pm

#82399 - ecurtz - Sat May 06, 2006 12:11 am

#82407 - Chris Holmes - Sat May 06, 2006 1:20 am

#82414 - wintermute - Sat May 06, 2006 2:14 am

#82424 - Chris Holmes - Sat May 06, 2006 4:06 am

#82430 - Mighty Max - Sat May 06, 2006 7:35 am

#82453 - wintermute - Sat May 06, 2006 3:06 pm