gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Coding > ASM Performance Question

#170077 - Seahawk - Fri Aug 28, 2009 4:16 am

I am learning how to write ASM code for GBA. Leaving it to the compiler to make code for me is probably better, but I still want to try.

I've noticed a difference in the "style" of how the compiler generates code for loading data into an I/O port location.

For instance, if I use the C-way of loading a register
Code:

WAIT_CONTROL = 0x4003;


devkitPro generates the following code, viewed through VBA disassembler window.
Code:

080005ee ldr r2, [$080005f8] (=$00004003)
080005f0 ldr r3, [$080005fc] (=$04000204)
080005f2 strh r2, [r3, #0x0]
080005f4 b $080005f


The way I wrote it was:
Code:

080005ee mov r0, #0x80
080005f0 lsl r0, r0, #0x7
080005f2 add r0, #0x3
080005f4 mov r1, #0x80
080005f6 lsl r1, r1, #0x0a
080005f8 add r1, #0x1
080005fa lsl r1, r1, #0x7
080005fc add r1, #0x1
080005fe lsl r1, r1, #0x02
08000600 strh r0, [r1, #0x0]


My code uses 1 non sequential accesses on ROM and 9 sequential accesses. Devkitpro's code uses 6 nonsequential accesses and 1 sequential.

Which version would be faster generally? I know sequential access on ROM is faster, but would the greater amount of statements to process in my code negate this advantage?

#170080 - Dwedit - Fri Aug 28, 2009 5:24 am

You optimize the code that runs 18,000,00 times, and forget about optimizing the code that runs 7 times.

But anyway, somehow go track down a copy of NO$GBA, and it tells you how many cycles it took to execute.

I almost never use Thumb ASM anyway.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#170081 - eKid - Fri Aug 28, 2009 5:45 am

They should both take around the same time to execute, although the first one has that branch at the end which pushes the cycle count a bit above the second one. (but the first one is smaller...)

#170085 - kusma - Fri Aug 28, 2009 6:11 pm

eKid wrote:
They should both take around the same time to execute, although the first one has that branch at the end which pushes the cycle count a bit above the second one. (but the first one is smaller...)

AFAICT, the branch is just the following code, so it shouldn't be counted.

#170087 - Cearn - Fri Aug 28, 2009 8:03 pm

Cycle counting becomes a little complicated when it comes to memory accesses because you have to split them up into code and data parts, and then deal with waitstates and bus-sizes. You can find all these at gbatek:gbamemmap. In these cases, I think it comes down to the following:

Code:
@ Assuming ROM code, default waitstates.
@ Affixes 'c and 'd' mean code and data.
@ The number indicates the section (like 8 for ROM)
@ 'w' and 'h' mean word and halfword access.
@ The strh doesn't matter, as it's part of both algorithms.
@ NOTE: just a theoretical estimate.

ldr     r0,=0x4003          @ Nc(8h) + Nd(8w) + I
ldr     r1,=0x04000204      @ Nc(8h) + Nd(8w) + I

Total : 2*N(8h) + 2*N(8w) + 2*I = 2*5 + 2*8 + 2*1 = 28


mov     r0, #0x80       
lsl     r0, r0, #0x7
add     r0, #0x3
mov     r1, #0x80
lsl     r1, r1, #0x0a
add     r1, #0x1
lsl     r1, r1, #0x7
add     r1, #0x1
lsl     r1, r1, #0x02

All 1S(8h) : 9*3 = 27


So the longer version is actually a teensy bit faster. You can even reduce that down to 6 instructions in this case:
Code:

mov     r0, #0x40           @ r0 = 0x0040
lsl     r1, r0, #17         @ r1 = 0x00800000
add     r1, r0              @ r1 = 0x00800040
lsl     r1, r1, #3          @ r1 = 0x04000200
lsl     r0, r0, #8          @ r0 = 0x4000
add     r0, #3              @ r0 = 0x4003

strh    r0, [r1, #4]

However, none of that matters because of what Dwedit said:
Dwedit wrote:
You optimize the code that runs 18,000,00 times, and forget about optimizing the code that runs 7 times.
This is not the kind of code you'd put in an inner loop, so it would have little influence of the overall speed. When choosing what to optimize, *jedi handwave* this is not the code you're looking for .

Start with the LDR versions for constants; they're easier to write and debug. Transfer to the other version when you find it really matters. Or, of course, if you really, really want to.

#170091 - Ruben - Sat Aug 29, 2009 7:39 am

Cearn wrote:
Start with the LDR versions for constants; they're easier to write and debug. Transfer to the other version when you find it really matters. Or, of course, if you really, really want to.

Or if you're doing an optimization exercise and/or trying to learn how to code fast assembler as an exercise. ^^'

#170103 - Pate - Mon Aug 31, 2009 6:57 am

Cearn wrote:

Code:

lsl     r1, r0, #17         @ r1 = 0x00800000
add     r1, r0              @ r1 = 0x00800040



Just wondering, is it not allowed to use the same source register twice, like this:

Code:

add r1, r0, r0, lsl #17


I'm just learning ASM at the moment, so please excuse me if that is just not possible.

Pate
_________________

#170105 - Ruben - Mon Aug 31, 2009 7:44 am

Yes, that is possible, which allows one to do stuff like this...

Code:
@ Aim: Build 0x11111111
mov r0,             #0x11 @ 0x00000011
orr r0, r0, r0, lsl #0x08 @ 0x00001111
orr r0, r0, r0, lsl #0x10 @ 0x11111111


However, this only works for ARM: for THUMB code, you must do all the shifting manually like this

Code:
mov r0,     #0x11 @ 0x00000011
lsl r1, r0, #0x08 @ 0x00001100
orr r0, r1        @ 0x00001111
lsl r1, r0, #0x10 @ 0x11110000
orr r0, r1        @ 0x11111111

#170106 - Pate - Mon Aug 31, 2009 8:08 am

Ruben wrote:

However, this only works for ARM: for THUMB code, you must do all the shifting manually like this


Ah, okay, I missed the fact that it was thumb code. Sorry about that, and thanks for the info!

Pate
_________________

#170639 - Ant6n - Fri Oct 09, 2009 5:06 am

Cearn wrote:
When choosing what to optimize, *jedi handwave* this is not the code you're looking for .

awesomness. :)