gbadev.org forum archive

I was wondering just how much of an impact a branch instruction would have in a loop. For instance, what if you had something like:

Code:

mov r0,#0
looptop:
cmp r0,#128 @ loop 128 times
beq endloop
@ do stuff here ...
cmp r1,#0 @ some other arbitrary number
beq loopbottom @ skip if equal
@ do other stuff here
loopbottom:
add r0,r0,#1
b looptop
endloop:
@ finished with loop here

How much would the cmp and beq in the middle of the loop cost, aside from extra instructions? Would there be a big impact with the cache, etc? I'm trying to optimize the inner loop in a time critical task, and I'm thinking it might be better if I sort out certain criteria before hand and run different loops based on these criteria. The only problem is that the code will bloat if I have too many different versions of the loop not to mention the fact that it'll be a maintenance nightmare :/

Any suggestions?
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin

GBA has no cache, so you can count cycles exactly. Also, any conditional instruction that doesn't get executed takes 1 instruction, so that's what it will take unless it's executed, and then it will be more. 2S + 1N, according to GBATEK, which in IWRAM will be 3 cycles.
But surely it would be better to do the main looping check and branch if NOT equal, at the end of the loop instead of branching to the top and then checking if you need to branch. And then if you loop from 128 down instead of 0 up, you can use the condition setting ability of the ARM to cut the cmp entirely

Code:

mov r0,#128
looptop:
@ do stuff here ...
cmp r1,#0 @ some other arbitrary number
beq loopbottom @ skip if equal
@ do other stuff here
loopbottom:
subs r0,r0,#1
bne looptop
@ finished with loop here

Or if you're working in THUMB, sub always sets the condition flags, so you still don't need to cmp, but I assume you're working in ARM in IWRAM if it's speed-critical.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

If the branch isn't taken it only takes 1 cycle to execute, just like most ARM instructions. If it is taken then it causes a pipeline stall which I think takes an extra 2 cycles (or possibly 3 - I can't remember off the top of my head). The GBA's processor has no cache, so you don't need to worry about that.

If "beq loopbottom" is very rare, the code as shown takes 8 cycles/loop. The first optimisation I'd make would be to make the loop counter count to zero rather than up to 128, i.e.:

Code:

mov r0,#128
looptop:
@ do stuff here ...
cmp r1,#0? ?@ some other arbitrary number
beq loopbottom @ skip if equal
@ do other stuff here
loopbottom:
subs r0,r0,#1
bgt looptop
@ finished with loop here

With the same assumptions mentioned above, this would take 6 cycles/loop. If your loop is short you could unroll it - with the code you've specified this would take just 2 cycles/loop - i.e. only the "cmp" and the "beq loopbottom" would remain.

How much of an effect removing the cmp and beq will have on the overall performance depends on how many cycles are taken up by the other stuff in your inner loop. If it's trivial to do it in advance then do it, but if doing it in advance requires more complex maths than doing it in the inner loop then you'll have to work out how many CPU cycles it would take with each approach and then make your decision based on that. However, in most cases it's a good idea to remove as much stuff as possible from your inner loop.

(By the way, all the timings above assume you're running the code from IWRAM.)

EDIT: Drat, DekuTree beat me to it. ;)

Last edited by jd on Tue Feb 10, 2004 4:06 am; edited 3 times in total

There is no cache on the GBA.

Branch instructions flush the pipeline, which basically stalls the processor for two additional cycles.

There is an additional performance hit if your code is running from ROM and you use a branch, because the next instruction needs to be read non-sequentially which takes a lot longer than a sequential read. (I'd imagine this penalty would also be incurred whenever you use the ldr/str commands, anyone care to confirm?)

Furthermore, if you turn on ROM-prefetch (which drains the battery quicker) then sequential instructions have a zero-waitstate (which means they're loaded immediately), but again you lose this advantage every time you branch.

The upshot of all of this is:

- Avoid branches as much as possible, especially when running from ROM. You can avoid backwards branches by unrolling your loops, and you can avoid forward branches by using conditional execution (ARM mode only).

- If you are optimizing a time-critical inner loop like you say, you should be doing it in ARM code in IWRAM. Branches have less of a penalty there because of the zero waitstate, but you should still try to unroll and use conditional execution as much as possible. It wasn't clear if your routine was ARM or Thumb, but if it's performance-critical then make it ARM and make it in IWRAM.

- Testing and branching forward (like the one you have in the middle of your loop) for special cases is worthwhile if you are skipping a significant amount of code, and the special case occurs often enough for it to be worthwhile. Some judgement is needed here.

- You obviously need to make an acceptable tradeoff between code bloat and optimization for speed. Yes, you only have 32K of IWRAM, but hey, this is precisely the kind of thing it's there for.

The only structural alteration I would make to your loop setup is to move your test to the end of the loop, so that you're saving a branch instruction when the loop terminates, for what it's worth. If you aren't actually using the value in your counter, I would also make it a decrementing counter so that you can save the cmp instruction. ie.:

Code:

mov    r0, #128
looptop:
@ do stuff
cmp    r1, #0
beq    loopbottom
@ do more stuff
loopbottom:
subs    r0, r0, #1
bne    looptop

@ finished with loop here

Hope this helps,

Dan.

EDIT: Figures I'd be the last one to get my response in... :P

LOL, I figured I was racing Dan to the punch, but I didn't expect James too^_^
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

LOL thanks for racing to answer my question guys. :)

Well, I'll keep all of these things you 3 mentioned in consideration. Hopefully I'll be able to squeeze more juice out of my routine. Thanks!
_________________
"Beer is proof that God loves us and wants us to be happy."
-- Benjamin Franklin

gbadev.org forum archive

ASM > Branches within loops

#16166 - animension - Tue Feb 10, 2004 1:34 am

#16170 - DekuTree64 - Tue Feb 10, 2004 3:52 am

#16171 - jd - Tue Feb 10, 2004 3:56 am

#16172 - poslundc - Tue Feb 10, 2004 3:58 am

#16176 - DekuTree64 - Tue Feb 10, 2004 4:17 am

#16213 - animension - Tue Feb 10, 2004 7:14 pm