gbadev.org forum archive

Hi!
Could anyone help me?
I have a vector of 4 unsigned bytes and I must sum it with another. The point is the sum must be upperbounded to 0xFF. It means it must not wrap.
Example:

Code:

u8* a = {0x00,0xF0,0x00,0xFF};
u8* b = {0x00,0x10,0x00,0x01};

a"+"b (u8* sum) should be

Code:

sum={0x00,0xFF,0x00,0xFF}

How can I do that? Should I use an auxiliar var of 16 bits and check if it's bigger than 0xFF? Or is switching to asm to check overflow flag better?

GBA/DS's registers are 32-bit, so you'll just have to do it by hand:

Code:

for(int i = 0; i < 4; i++)
{
int tempSum = a[i] + b[i];
if (tempSum > 0xff)
tempSum = 0xff;
sum[i] = tempSum;
}

Boring, but simple enough that a compiler should be able to generate the same code a human would for it anyway.

If it's going to be called lots and lots of times, unroll the loop and compile it as ARM. That way there won't be any branching at all.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

Thanks for the second tip, it'll be called a lot of time!
How can I force it to ARM? I must implement this function in a separated file to do it? Sorry this dumb question, I searched in the forum, but I'm not so good on it...

IF this is DS, there are saturation functions to handle this.

To compile as arm, you simply need to pass a different flag into the compiler (-marm instead of -mthumb). You can give that file a special extension, such as .arm.c and then make a special target for it.

Or you could do it with some simple assembly, by including this snippet in a .s file. The C declaration for the function would be:

Code:

void SumByteArray(u8 *a, u8 *b, u8 *sum);

Code:

.MACRO sumByte
ldrb r3, [r0], #1
ldrb r4, [r1], #1
add r3, r3, r4
cmp r3, #0xff
movgt r3, #0xff
strb r3, [r2], #1
.ENDM

.GLOBAL SumByteArray
SumByteArray:
stmfd sp!, {r4}
sumByte
sumByte
sumByte
sumByte
ldmfd sp!, {r4}
bx lr

This is fully unrolled, you could make it loop and save some space if you needed to. Also, you'll want to have the code run from IWRAM if it is speed critical (any ARM code you have should be run from IWRAM).

I'll take that as a challenge, gladius ;)

Code:

.GLOBAL SumByteArray
SumByteArray:
stmfd sp!, {r4-r5}
mov r5, #0xff
orr r5, r5, #0xff0000
ldr r3, [r0], #4
ldr r4, [r1], #4

and r0, r5, r3
and r1, r5, r4
add r0, r0, r1
tst r0, #0xff00
orrne r0, r0, #0xff
tst r0, #0xff000000
orrne r0, r0, #0xff0000
and r0, r0, r5

and r3, r5, r3, lsr #8
and r4, r5, r4, lsr #8
add r3, r3, r4
tst r3, #0xff00
orrne r3, r3, #0xff
tst r3, #0xff000000
orrne r3, r3, #0xff0000
and r3, r3, r5

orr r0, r0, r3, lsl #8
str r0, [r2]
ldmfd sp!, {r4-r5}
bx lr

_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

:)

25 instructions for yours versus 27 for mine, and you only have two 32 bit reads and one 32 bit store versus 8 8 bit reads and 4 8 bit writes for my version.

So, all in all, a nice improvement. Just one thing to be careful of, the arrays have to be aligned now, and u8 arrays are not aligned by default.

Amazing both of you!
So, u8 arrays are not aligned? it explains another issues here ;)
And if I use an u32 var to do it? So, there's no problem at all, right?
Thank u both

He he he, I'll beat that Deku.. :P

Code:

.GLOBAL SumByteArray
SumByteArray:
stmfd sp!, {r4-r5}
mov r5, #0xff00
orr r5, r5, #0xff000000
ldr r3, [r0], #4
ldr r4, [r1], #4

and r0, r5, r3,lsl#8
and r1, r5, r4,lsl#8
adds r0, r0, r1
orrcs r0, r0, #0xff000000
tst r0, #0xff0000
orrne r0, r0, #0xff00
and r0, r0, r5

and r3, r5, r3
and r4, r5, r4
adds r3, r3, r4
orrcs r0, r0, #0xff000000
tst r0, #0xff0000
orrne r0, r0, #0xff00
and r3, r3, r5

orr r0, r3, r0, lsr #8
str r0, [r2]
ldmfd sp!, {r4-r5}
bx lr

_________________
I probably suck, my not is a programmer.

Use r12 instead of r5 to save a reg push/pop.

Return the added bytes instead of storing in an array to eliminate the rest of the push pops.

Code:

ldr r2, [r0], #4
ldr r3, [r1], #4

mov r12, #0xff00
orr r12, r12, #0xff000000

and r0, r12, r2,lsl#8
and r1, r12, r3,lsl#8
adds r0, r0, r1
orrcs r0, r0, #0xff000000
tst r0, #0xff0000
orrne r0, r0, #0xff00
and r0, r0, r12

and r2, r12, r2
and r3, r12, r3
adds r2, r2, r3
orrcs r0, r0, #0xff000000
tst r0, #0xff0000
orrne r0, r0, #0xff00
and r2, r2, r12

orr r0, r2, r0, lsr #8

bx lr

I may have munged it a bit, but you get the idea :)

DekuTree64 wrote:

I'll take that as a challenge...

I'm bored, so why not?

( 12 instructions )

Code:

@
@ void Summit(const u8 *sum1, const u8 *sum2, u8 *dest, u32 size);
@ Adds to arrays with an overflow of FF
@
@ sum1, sum2 - Arrays to add
@ dest - Where to store the result
@ size - Number of elements to add
@
@ Random notes:
@ * The main loop (without counter math) has a steady execution speed,
@ regardless of overflow (I hope)
@

.ARM
.global Summit

Summit:

.start:
stmfd sp!, {r4, r5} @ Push registers

.loop:
ldrb r4, [r0], #1 @ I wish this and the next ins. could be combined (maybe they can...)
mov r4, r4, lsl #24 @ Shift it to the left-most byte

ldrb r5, [r1], #1 @ Now for sum2

adds r4, r4, r5, lsl #24 @ Add! (if this looks confusing, see below)
movcs r4, #0xFF @ If > FF, set to FF
movcc r4, r4, lsr #24 @

strb r4, [r2], #1 @ Store the result

subs r3, r3, #1 @ Decrement counter
bne .loop @ Non-zero loops

.end:
ldmfd sp!, {r4, r5} @ Pop registers

bx lr @ Return

@ EOF

Works on variable sizes and uses 'hardware 8-bit simulation' :)

Idea: If the two sum tables were interleaved, I could write up a piece of code that has a few less instructions (and probably faster code).

:) The goal is not to decrease the number of instructions in the function, but to decrease the number of instructions executed (or more precisely, the number of CPU cycles spent). What you've done is optimize for code size while sacrificing CPU, though that's not necessarily wrong in itself.

sajiimori wrote:

:) The goal is not to decrease the number of instructions in the function, but to decrease the number of instructions executed (or more precisely, the number of CPU cycles spent)...

Oh, right...

(D'oh!)

Funny little stream, guys. ;)

Back to the original question... don't bother optimizing performance until:

1. You are sure that your logic and structure is working in C.

2. You see a performance problem

Much better to get your game working first, even if it's slow. I've heard of many failed projects because the programmer gets sidetracked on some unnecessary optimization challenge.

:)

Scott

Strager if you look at Gladius code it only uses 3 instructions for the calculation while yours uses 4 instruction, but if you aim for size you can probably beat it by using C compiled to Thumb code anyway.
_________________
I probably suck, my not is a programmer.

gbadev.org forum archive

C/C++ > upperbounded sum

#51526 - rodolforg - Sun Aug 21, 2005 6:21 am

#51528 - DekuTree64 - Sun Aug 21, 2005 8:20 am

#51530 - rodolforg - Sun Aug 21, 2005 10:05 am

#51673 - Miked0801 - Mon Aug 22, 2005 10:46 pm

#51679 - gladius - Mon Aug 22, 2005 11:04 pm

#51696 - DekuTree64 - Tue Aug 23, 2005 1:44 am

#51716 - gladius - Tue Aug 23, 2005 4:04 am

#51732 - rodolforg - Tue Aug 23, 2005 5:31 am

#51749 - FluBBa - Tue Aug 23, 2005 9:04 am

#51816 - Miked0801 - Tue Aug 23, 2005 7:17 pm

#52064 - strager - Fri Aug 26, 2005 1:50 am

#52066 - sajiimori - Fri Aug 26, 2005 2:26 am

#52088 - strager - Fri Aug 26, 2005 12:35 pm

#52099 - ScottLininger - Fri Aug 26, 2005 5:58 pm

#52373 - FluBBa - Tue Aug 30, 2005 9:24 am