#51526 - rodolforg - Sun Aug 21, 2005 6:21 am
Hi!
Could anyone help me?
I have a vector of 4 unsigned bytes and I must sum it with another. The point is the sum must be upperbounded to 0xFF. It means it must not wrap.
Example:
Code: |
u8* a = {0x00,0xF0,0x00,0xFF};
u8* b = {0x00,0x10,0x00,0x01}; |
a"+"b (u8* sum) should be
Code: |
sum={0x00,0xFF,0x00,0xFF}
|
How can I do that? Should I use an auxiliar var of 16 bits and check if it's bigger than 0xFF? Or is switching to asm to check overflow flag better?
#51528 - DekuTree64 - Sun Aug 21, 2005 8:20 am
GBA/DS's registers are 32-bit, so you'll just have to do it by hand:
Code: |
for(int i = 0; i < 4; i++)
{
int tempSum = a[i] + b[i];
if (tempSum > 0xff)
tempSum = 0xff;
sum[i] = tempSum;
} |
Boring, but simple enough that a compiler should be able to generate the same code a human would for it anyway.
If it's going to be called lots and lots of times, unroll the loop and compile it as ARM. That way there won't be any branching at all.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#51530 - rodolforg - Sun Aug 21, 2005 10:05 am
Thanks for the second tip, it'll be called a lot of time!
How can I force it to ARM? I must implement this function in a separated file to do it? Sorry this dumb question, I searched in the forum, but I'm not so good on it...
#51673 - Miked0801 - Mon Aug 22, 2005 10:46 pm
IF this is DS, there are saturation functions to handle this.
#51679 - gladius - Mon Aug 22, 2005 11:04 pm
To compile as arm, you simply need to pass a different flag into the compiler (-marm instead of -mthumb). You can give that file a special extension, such as .arm.c and then make a special target for it.
Or you could do it with some simple assembly, by including this snippet in a .s file. The C declaration for the function would be:
Code: |
void SumByteArray(u8 *a, u8 *b, u8 *sum);
|
Code: |
.MACRO sumByte
ldrb r3, [r0], #1
ldrb r4, [r1], #1
add r3, r3, r4
cmp r3, #0xff
movgt r3, #0xff
strb r3, [r2], #1
.ENDM
.GLOBAL SumByteArray
SumByteArray:
stmfd sp!, {r4}
sumByte
sumByte
sumByte
sumByte
ldmfd sp!, {r4}
bx lr
|
This is fully unrolled, you could make it loop and save some space if you needed to. Also, you'll want to have the code run from IWRAM if it is speed critical (any ARM code you have should be run from IWRAM).
#51696 - DekuTree64 - Tue Aug 23, 2005 1:44 am
I'll take that as a challenge, gladius ;)
Code: |
.GLOBAL SumByteArray
SumByteArray:
stmfd sp!, {r4-r5}
mov r5, #0xff
orr r5, r5, #0xff0000
ldr r3, [r0], #4
ldr r4, [r1], #4
and r0, r5, r3
and r1, r5, r4
add r0, r0, r1
tst r0, #0xff00
orrne r0, r0, #0xff
tst r0, #0xff000000
orrne r0, r0, #0xff0000
and r0, r0, r5
and r3, r5, r3, lsr #8
and r4, r5, r4, lsr #8
add r3, r3, r4
tst r3, #0xff00
orrne r3, r3, #0xff
tst r3, #0xff000000
orrne r3, r3, #0xff0000
and r3, r3, r5
orr r0, r0, r3, lsl #8
str r0, [r2]
ldmfd sp!, {r4-r5}
bx lr |
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#51716 - gladius - Tue Aug 23, 2005 4:04 am
:)
25 instructions for yours versus 27 for mine, and you only have two 32 bit reads and one 32 bit store versus 8 8 bit reads and 4 8 bit writes for my version.
So, all in all, a nice improvement. Just one thing to be careful of, the arrays have to be aligned now, and u8 arrays are not aligned by default.
#51732 - rodolforg - Tue Aug 23, 2005 5:31 am
Amazing both of you!
So, u8 arrays are not aligned? it explains another issues here ;)
And if I use an u32 var to do it? So, there's no problem at all, right?
Thank u both
#51749 - FluBBa - Tue Aug 23, 2005 9:04 am
He he he, I'll beat that Deku.. :P
Code: |
.GLOBAL SumByteArray
SumByteArray:
stmfd sp!, {r4-r5}
mov r5, #0xff00
orr r5, r5, #0xff000000
ldr r3, [r0], #4
ldr r4, [r1], #4
and r0, r5, r3,lsl#8
and r1, r5, r4,lsl#8
adds r0, r0, r1
orrcs r0, r0, #0xff000000
tst r0, #0xff0000
orrne r0, r0, #0xff00
and r0, r0, r5
and r3, r5, r3
and r4, r5, r4
adds r3, r3, r4
orrcs r0, r0, #0xff000000
tst r0, #0xff0000
orrne r0, r0, #0xff00
and r3, r3, r5
orr r0, r3, r0, lsr #8
str r0, [r2]
ldmfd sp!, {r4-r5}
bx lr |
_________________
I probably suck, my not is a programmer.
#51816 - Miked0801 - Tue Aug 23, 2005 7:17 pm
Use r12 instead of r5 to save a reg push/pop.
Return the added bytes instead of storing in an array to eliminate the rest of the push pops.
Code: |
ldr r2, [r0], #4
ldr r3, [r1], #4
mov r12, #0xff00
orr r12, r12, #0xff000000
and r0, r12, r2,lsl#8
and r1, r12, r3,lsl#8
adds r0, r0, r1
orrcs r0, r0, #0xff000000
tst r0, #0xff0000
orrne r0, r0, #0xff00
and r0, r0, r12
and r2, r12, r2
and r3, r12, r3
adds r2, r2, r3
orrcs r0, r0, #0xff000000
tst r0, #0xff0000
orrne r0, r0, #0xff00
and r2, r2, r12
orr r0, r2, r0, lsr #8
bx lr
|
I may have munged it a bit, but you get the idea :)
#52064 - strager - Fri Aug 26, 2005 1:50 am
DekuTree64 wrote: |
I'll take that as a challenge... |
I'm bored, so why not?
( 12 instructions )
Code: |
@
@ void Summit(const u8 *sum1, const u8 *sum2, u8 *dest, u32 size);
@ Adds to arrays with an overflow of FF
@
@ sum1, sum2 - Arrays to add
@ dest - Where to store the result
@ size - Number of elements to add
@
@ Random notes:
@ * The main loop (without counter math) has a steady execution speed,
@ regardless of overflow (I hope)
@
.ARM
.global Summit
Summit:
.start:
stmfd sp!, {r4, r5} @ Push registers
.loop:
ldrb r4, [r0], #1 @ I wish this and the next ins. could be combined (maybe they can...)
mov r4, r4, lsl #24 @ Shift it to the left-most byte
ldrb r5, [r1], #1 @ Now for sum2
adds r4, r4, r5, lsl #24 @ Add! (if this looks confusing, see below)
movcs r4, #0xFF @ If > FF, set to FF
movcc r4, r4, lsr #24 @
strb r4, [r2], #1 @ Store the result
subs r3, r3, #1 @ Decrement counter
bne .loop @ Non-zero loops
.end:
ldmfd sp!, {r4, r5} @ Pop registers
bx lr @ Return
@ EOF
|
Works on variable sizes and uses 'hardware 8-bit simulation' :)
Idea: If the two sum tables were interleaved, I could write up a piece of code that has a few less instructions (and probably faster code).
#52066 - sajiimori - Fri Aug 26, 2005 2:26 am
:) The goal is not to decrease the number of instructions in the function, but to decrease the number of instructions executed (or more precisely, the number of CPU cycles spent). What you've done is optimize for code size while sacrificing CPU, though that's not necessarily wrong in itself.
#52088 - strager - Fri Aug 26, 2005 12:35 pm
sajiimori wrote: |
:) The goal is not to decrease the number of instructions in the function, but to decrease the number of instructions executed (or more precisely, the number of CPU cycles spent)... |
Oh, right...
(D'oh!)
#52099 - ScottLininger - Fri Aug 26, 2005 5:58 pm
Funny little stream, guys. ;)
Back to the original question... don't bother optimizing performance until:
1. You are sure that your logic and structure is working in C.
2. You see a performance problem
Much better to get your game working first, even if it's slow. I've heard of many failed projects because the programmer gets sidetracked on some unnecessary optimization challenge.
:)
Scott
#52373 - FluBBa - Tue Aug 30, 2005 9:24 am
Strager if you look at Gladius code it only uses 3 instructions for the calculation while yours uses 4 instruction, but if you aim for size you can probably beat it by using C compiled to Thumb code anyway.
_________________
I probably suck, my not is a programmer.