gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Strange?

#18669 - johsta - Wed Mar 31, 2004 12:54 pm

hi

This is two ways of doing the exact same thing...
the thing is that the asm code with 7 instructions i much slower then the one with 13 instructions... why??? something to do with the pipeline? I don't got a clue.. btw this is done in a tight loop...

the C/C++ expression is: relX=(tileX<<3)+x-objHalfWidth;



ASM - Fast:

ldr r3, [r11, -#128]
str r3, [r11, -#112]

ldr r3, [r11, -#112]
mov r3, r3, lsl #3
str r3, [r11, -#112]

ldr r2, [r11, -#112]
ldr r3, [r11, -#68]
add r3, r2, r3
str r3, [r11, -#112]

ldr r2, [r11, -#112]
ldr r3, [r11, -#88]
rsb r3, r3, r2
str r3, [r11, -#112]

ASM - slow

ldr r3, [r11, -#128]
mov r2, r3, lsl #3
ldr r3, [r11, -#68]
add r2, r2, r3
ldr r3, [r11, -#88]
rsb r3, r3, r2
str r3, [r11, -#112]

thanks,

#18673 - poslundc - Wed Mar 31, 2004 3:10 pm

There is no logical reason for the first version to run faster than the second version.

Are you certain about your findings? What was the approximate difference in speed? Did you try this on hardware? Emulators can be misleading in this stuff.

Dan.

#18676 - Lupin - Wed Mar 31, 2004 4:12 pm

The second version IS faster. Just count the cycles, you will (of course) end up with less cycles in the second version, you don't need any profiling to find out that 13 instructions are slower than 7 instructions (this is also because most of the operations on arm cpu only take up 1-3 cycles, only mul is taking a lot more :))
_________________
Team Pokeme
My blog and PM ASM tutorials

#18678 - poslundc - Wed Mar 31, 2004 5:37 pm

Lupin wrote:
The second version IS faster. Just count the cycles, you will (of course) end up with less cycles in the second version, you don't need any profiling to find out that 13 instructions are slower than 7 instructions (this is also because most of the operations on arm cpu only take up 1-3 cycles, only mul is taking a lot more :))


<shakes head>

MUL only takes between 2 and 5 cycles, depending on the operands.

Even SMLAL and UMULL - the multiply-long-and-accumulate instructions - only take between 4 and 7 cycles.

The waitstates for ROM and EWRAM can result in load, store and branch instructions taking considerably longer than a multiply instruction.

Dan.

#18706 - johsta - Thu Apr 01, 2004 12:49 pm

the code with 13 instructions equals tho this...

relX = tileX;
relX <<= 3;
relX += x;
relX -= objHalfWidth;

this is said to run faster in c++, and does run faster, but then why does it generate more instructions then

relX= (tileX<<8)+x-objHalfWidth;


:) ?


Last edited by johsta on Thu Apr 01, 2004 12:58 pm; edited 1 time in total

#18708 - poslundc - Thu Apr 01, 2004 12:52 pm

It would be a lot more helpful if you answered my questions about your findings first.

Dan.

#18709 - johsta - Thu Apr 01, 2004 12:57 pm

relX= (tileX<<8)+x-objHalfWidth;

this is supposed to generate temporary variables... and the other expression, only work with the ref.

But, it must be slower to write the data back all the time with str... compared to only using the registers

#18710 - johsta - Thu Apr 01, 2004 1:02 pm

This was done on the vba emulator and the speed difference is about
40 % when the code runs in IWRAM and about 20% on the ROM...

#18715 - torne - Thu Apr 01, 2004 2:31 pm

johsta wrote:
relX= (tileX<<8)+x-objHalfWidth;

this is supposed to generate temporary variables... and the other expression, only work with the ref.

But, it must be slower to write the data back all the time with str... compared to only using the registers

Who says this is 'supposed' to generate temporary variables? As the intermediate values are not accessible by the program, it's the compiler's right to optimise them out of existence, which is exactly what it's done. What optimisation settings are you using on the compiler? The former code is clearly being generated with poor optimisation, as otherwise the redundant copies would be removed. A compiler with good optimisation settings enabled will generate identical code (under alpha conversion) for your two C code examples.

VBA is not completely cycle-accurate and cannot reasonably be used for testing the performance of code. Test on hardware. The shorter code is faster.