gbadev.org forum archive

I am a bit new to ARM assembly and I read somewhere that its better to avoid branching as much as possible because branching is slow. But I was wondering which is faster between the 2 versions of the same code (taken from Dooby's lib).

This is the original, no branching:

Code:

tst r0, #0x1 @ halfword aligned?

@ not aligned
ldrneh r1, [r0, #-1]! @ grab short holding byte we're writing
bicne r1, r1, #0xff00 @ blank pixel we're writing
orrne r1, r1, r2, lsl #8 @ insert pixel we're writing
strneh r1, [r0], #2 @ write & leave aligned to next halfword

@ aligned
ldreqh r1, [r0] @ grab halfword
biceq r1, r1, #0x00ff @ clear left pixel
orreq r1, r1, r2, lsr #24 @ insert left pixel
streqh r1, [r0] @ plot

bx r14

And this modification with just one branch and no conditions in the instructions:

Code:

tst r0, #0x1 @ halfword aligned?
beq L_Aligned

@ not aligned
ldrh r1, [r0, #-1]!    @ grab short holding byte we're writing
bic r1, r1, #0xff00    @ blank pixel we're writing
orr r1, r1, r2, lsl #8 @ insert pixel we're writing
strh r1, [r0], #2    @ write & leave aligned to next halfword
bx r14

@ aligned
L_Aligned:
ldrh r1, [r0]    @ grab halfword
bic r1, r1, #0x00ff @ clear left pixel
orr r1, r1, r2, lsr #24 @ insert left pixel
strh r1, [r0]    @ plot
bx r14

So which one is faster/better and why?

The second one is faster

The conditional code using the ARM branch costs 3 cycles or 1 cycle whereas the conditional code in the first example always costs 4 cycles.

IMO it should be possible to juggle this to achieve better performance than either of these examples - but its impossible to say for sure because the code is out of context.

Col

Last edited by col on Thu Apr 22, 2004 3:37 pm; edited 1 time in total

Assuming you are running this code from RAM instead of ROM (which you should be, especially since ARM code is most effective when run from IWRAM), the second version is definitely faster than the first.

This is because there is no penalty in RAM for random-access, so even though a branch costs three memory-accessing cycles, you aren't hurt by having to jump to a different location in the code.

The break-even point between using conditional execution and just branching depends on the type of code and where it's being run from, but usually it's in the neighbourhood of three instructions or so.

You should also consider where the program flow is likely to go most of the time. In a pixel-plotting routine you'll probably have 50% of the time take the aligned branch, and the other 50% of the time the unaligned branch. But if you are handling a special-case in a routine that will only occur 10% of the time, you obviously want to optimize for the other 90%.

Edit: col beat me, but only because the Internet seems to be way slow today...

Dan.

gbadev.org forum archive

ASM > ARM - Branching vs. conditional operands question

#19589 - isildur - Thu Apr 22, 2004 2:52 pm

#19591 - col - Thu Apr 22, 2004 3:25 pm

#19592 - poslundc - Thu Apr 22, 2004 3:26 pm