gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > ARM - Branching vs. conditional operands question

#19589 - isildur - Thu Apr 22, 2004 2:52 pm

I am a bit new to ARM assembly and I read somewhere that its better to avoid branching as much as possible because branching is slow. But I was wondering which is faster between the 2 versions of the same code (taken from Dooby's lib).

This is the original, no branching:
Code:

tst r0, #0x1        @ halfword aligned?

@ not aligned
ldrneh   r1, [r0, #-1]!      @ grab short holding byte we're writing
bicne   r1, r1, #0xff00   @ blank pixel we're writing
orrne   r1, r1, r2, lsl #8  @ insert pixel we're writing
strneh   r1, [r0], #2        @ write & leave aligned to next halfword

@ aligned
ldreqh   r1, [r0]               @ grab halfword
biceq   r1, r1, #0x00ff       @ clear left pixel
orreq   r1, r1, r2, lsr #24   @ insert left pixel
streqh   r1, [r0]               @ plot

bx r14



And this modification with just one branch and no conditions in the instructions:

Code:

tst r0, #0x1   @ halfword aligned?
beq L_Aligned

@ not aligned
ldrh r1, [r0, #-1]!      @ grab short holding byte we're writing
bic r1, r1, #0xff00      @ blank pixel we're writing
orr r1, r1, r2, lsl #8   @ insert pixel we're writing
strh r1, [r0], #2        @ write & leave aligned to next halfword
bx r14

@ aligned
L_Aligned:
ldrh   r1, [r0]      @ grab halfword
bic   r1, r1, #0x00ff   @ clear left pixel
orr   r1, r1, r2, lsr #24   @ insert left pixel
strh   r1, [r0]      @ plot
bx r14


So which one is faster/better and why?

#19591 - col - Thu Apr 22, 2004 3:25 pm

The second one is faster

The conditional code using the ARM branch costs 3 cycles or 1 cycle whereas the conditional code in the first example always costs 4 cycles.

IMO it should be possible to juggle this to achieve better performance than either of these examples - but its impossible to say for sure because the code is out of context.

Col


Last edited by col on Thu Apr 22, 2004 3:37 pm; edited 1 time in total

#19592 - poslundc - Thu Apr 22, 2004 3:26 pm

Assuming you are running this code from RAM instead of ROM (which you should be, especially since ARM code is most effective when run from IWRAM), the second version is definitely faster than the first.

This is because there is no penalty in RAM for random-access, so even though a branch costs three memory-accessing cycles, you aren't hurt by having to jump to a different location in the code.

The break-even point between using conditional execution and just branching depends on the type of code and where it's being run from, but usually it's in the neighbourhood of three instructions or so.

You should also consider where the program flow is likely to go most of the time. In a pixel-plotting routine you'll probably have 50% of the time take the aligned branch, and the other 50% of the time the unaligned branch. But if you are handling a special-case in a routine that will only occur 10% of the time, you obviously want to optimize for the other 90%.

Edit: col beat me, but only because the Internet seems to be way slow today...

Dan.