gbadev.org forum archive

I am learning how to write ASM code for GBA. Leaving it to the compiler to make code for me is probably better, but I still want to try.

I've noticed a difference in the "style" of how the compiler generates code for loading data into an I/O port location.

For instance, if I use the C-way of loading a register

Code:

WAIT_CONTROL = 0x4003;

devkitPro generates the following code, viewed through VBA disassembler window.

Code:

080005ee ldr r2, [$080005f8] (=$00004003)
080005f0 ldr r3, [$080005fc] (=$04000204)
080005f2 strh r2, [r3, #0x0]
080005f4 b $080005f

The way I wrote it was:

Code:

080005ee mov r0, #0x80
080005f0 lsl r0, r0, #0x7
080005f2 add r0, #0x3
080005f4 mov r1, #0x80
080005f6 lsl r1, r1, #0x0a
080005f8 add r1, #0x1
080005fa lsl r1, r1, #0x7
080005fc add r1, #0x1
080005fe lsl r1, r1, #0x02
08000600 strh r0, [r1, #0x0]

My code uses 1 non sequential accesses on ROM and 9 sequential accesses. Devkitpro's code uses 6 nonsequential accesses and 1 sequential.

Which version would be faster generally? I know sequential access on ROM is faster, but would the greater amount of statements to process in my code negate this advantage?

You optimize the code that runs 18,000,00 times, and forget about optimizing the code that runs 7 times.

But anyway, somehow go track down a copy of NO$GBA, and it tells you how many cycles it took to execute.

I almost never use Thumb ASM anyway.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

They should both take around the same time to execute, although the first one has that branch at the end which pushes the cycle count a bit above the second one. (but the first one is smaller...)

eKid wrote:

They should both take around the same time to execute, although the first one has that branch at the end which pushes the cycle count a bit above the second one. (but the first one is smaller...)

AFAICT, the branch is just the following code, so it shouldn't be counted.

Cycle counting becomes a little complicated when it comes to memory accesses because you have to split them up into code and data parts, and then deal with waitstates and bus-sizes. You can find all these at gbatek:gbamemmap. In these cases, I think it comes down to the following:

Code:

@ Assuming ROM code, default waitstates.
@ Affixes 'c and 'd' mean code and data.
@ The number indicates the section (like 8 for ROM)
@ 'w' and 'h' mean word and halfword access.
@ The strh doesn't matter, as it's part of both algorithms.
@ NOTE: just a theoretical estimate.

ldr r0,=0x4003 @ Nc(8h) + Nd(8w) + I
ldr r1,=0x04000204 @ Nc(8h) + Nd(8w) + I

Total : 2*N(8h) + 2*N(8w) + 2*I = 2*5 + 2*8 + 2*1 = 28

mov r0, #0x80
lsl r0, r0, #0x7
add r0, #0x3
mov r1, #0x80
lsl r1, r1, #0x0a
add r1, #0x1
lsl r1, r1, #0x7
add r1, #0x1
lsl r1, r1, #0x02

All 1S(8h) : 9*3 = 27

So the longer version is actually a teensy bit faster. You can even reduce that down to 6 instructions in this case:

Code:

mov r0, #0x40 @ r0 = 0x0040
lsl r1, r0, #17 @ r1 = 0x00800000
add r1, r0 @ r1 = 0x00800040
lsl r1, r1, #3 @ r1 = 0x04000200
lsl r0, r0, #8 @ r0 = 0x4000
add r0, #3 @ r0 = 0x4003

strh r0, [r1, #4]

However, none of that matters because of what Dwedit said:

Dwedit wrote:

You optimize the code that runs 18,000,00 times, and forget about optimizing the code that runs 7 times.

This is not the kind of code you'd put in an inner loop, so it would have little influence of the overall speed. When choosing what to optimize, *jedi handwave* this is not the code you're looking for .

Start with the LDR versions for constants; they're easier to write and debug. Transfer to the other version when you find it really matters. Or, of course, if you really, really want to.

Cearn wrote:

Start with the LDR versions for constants; they're easier to write and debug. Transfer to the other version when you find it really matters. Or, of course, if you really, really want to.

Or if you're doing an optimization exercise and/or trying to learn how to code fast assembler as an exercise. ^^'

Cearn wrote:

Code:

lsl r1, r0, #17 @ r1 = 0x00800000
add r1, r0 @ r1 = 0x00800040

Just wondering, is it not allowed to use the same source register twice, like this:

Code:

add r1, r0, r0, lsl #17

I'm just learning ASM at the moment, so please excuse me if that is just not possible.

Pate
_________________

Now working on DSx86 http://dsx86.patrickaalto.com
Get LineWarsDS from http://linewars.patrickaalto.com

Yes, that is possible, which allows one to do stuff like this...

Code:

@ Aim: Build 0x11111111
mov r0, #0x11 @ 0x00000011
orr r0, r0, r0, lsl #0x08 @ 0x00001111
orr r0, r0, r0, lsl #0x10 @ 0x11111111

However, this only works for ARM: for THUMB code, you must do all the shifting manually like this

Code:

mov r0, #0x11 @ 0x00000011
lsl r1, r0, #0x08 @ 0x00001100
orr r0, r1 @ 0x00001111
lsl r1, r0, #0x10 @ 0x11110000
orr r0, r1 @ 0x11111111

Ruben wrote:

However, this only works for ARM: for THUMB code, you must do all the shifting manually like this

Ah, okay, I missed the fact that it was thumb code. Sorry about that, and thanks for the info!

Pate
_________________

Now working on DSx86 http://dsx86.patrickaalto.com
Get LineWarsDS from http://linewars.patrickaalto.com

Cearn wrote:

When choosing what to optimize, *jedi handwave* this is not the code you're looking for .

awesomness. :)

gbadev.org forum archive

Coding > ASM Performance Question

#170077 - Seahawk - Fri Aug 28, 2009 4:16 am

#170080 - Dwedit - Fri Aug 28, 2009 5:24 am

#170081 - eKid - Fri Aug 28, 2009 5:45 am

#170085 - kusma - Fri Aug 28, 2009 6:11 pm

#170087 - Cearn - Fri Aug 28, 2009 8:03 pm

#170091 - Ruben - Sat Aug 29, 2009 7:39 am

#170103 - Pate - Mon Aug 31, 2009 6:57 am

#170105 - Ruben - Mon Aug 31, 2009 7:44 am

#170106 - Pate - Mon Aug 31, 2009 8:08 am

#170639 - Ant6n - Fri Oct 09, 2009 5:06 am