gbadev.org forum archive

I'm looking for some tips for speeding up pixel manipulation on the DS including common math functions used for 2-d effects. I have been working the past few days on some simple pixel color grabbing and writing routines and trying to use these to draw effects such as blurring pixels, water effect and other simple effects... and I'm finding my functions are running quite slow. I am using mode 5 graphics and doing buffer flipping. if it helps, here are my 16 bit color pixel read and write routines:

Code:

void set_Pixel(u16* buffer,u8 x, u8 y, u8 R,u8 G,u8 B)
{
buffer[y * 256 + x] = RGB15(R, G, B) | BIT(15);
}

u8 get_R(u16* buffer,u8 x,u8 y)
{u8 r;
r=buffer[y * 256 + x];
r=r & 0x1F;
return r;
}
u8 get_G(u16* buffer, u8 x,u8 y)
{u8 g;
g=buffer[y * 256 + x] <<5;
g=g & 0x1F;
return g;
}
u8 get_B(u16* buffer,u8 x,u8 y)
{u8 b;
b=buffer[y * 256 + x] <<10;
b=b & 0x1f;
return b;
}

additionally, I am using this code to write the backbuffer to the frontbuffer, and I'm not sure if it's the right thing to use...

Code:

dmaCopy(backBuffer,frontBuffer,256*256*2);

Verify that optimizations are turned on, and look at the asm output to see if the compiler is inlining the functions or not.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Minimize load operations. Read only once from the buffer, and from that value extract the R, G and B (instead of reading thrice). I.e.:

Code:

col = buffer[y * 256 + x];
R = getRed(col);
G = getGreen(col);
B = getBlue(col);

and so on.

Prioritize linear accesses (i.e., read a pixel, then the next, then the next, and so on), instead of random readings. And try using pointers instead of array accesses:

Code:

u16 *buf = &buffer[y * 256 + x];
col1 = *buf; buf++;
col2 = *buf; buf++;
col3 = *buf; buf++;

And try to make all variables used for handing data ints instead, ints work faster. I.e.:

Code:

void set_Pixel(u16* buffer,int x, int y, int R,int G,int B);

int get_B(u16* buffer, int x, int y) {
int b;
...

I'd say about inlining the code and convert multiplications to shifting... but I guess that's the compiler's work. Go for mantainability.

And, well, that's my noob advice :)

If you use mode5, you should not copy the backbuffer to the frontbuffer at all.

Just tell the hardware that the backbuffer is now the frontbuffer and vice versa. Iirc there is an example how to do that in the libnds examples
_________________
GBAMP Multiboot

Inline functions (or macros) FTW.

As normal functions, set_Pixel and get_R/G/B act completely as separate entities, even though their functionality has many common elements. As such, the compiler can't do the kind of optimizations that kevinc was talking about (nor a few others). That the overhead that is a good deal larger than the actual content of the functions here probably doesn't do you much good either. As a mild estimate, inlining these functions will speed up set_Pixel calls with a factor 5, and get_R/G/B combos with 8 or more.

The compiler would inline these functions under -O3 and if the functions were in the same files as the functions that use them, but it's probably better to do inline them manually:

Code:

// 'static inline' makes it inline, then put it in a header file.
// Also, ints instead of u8 makes it faster.
static inline void set_Pixel(u16 *buffer, u32 x, u32 y, u32 R, u32 G, u32 B)
{
buffer[y * 256 + x] = RGB15(R, G, B) | BIT(15);
}

I'd almost suggest doing the whole thing manually anyway, it's not exactly much work. Depending on the effect, you might be able to work on 2 pixels at once too.

Why is int faster than u8? Isn't u8 unsigned integer 0-255? Also... why does simply adding static inline to the function speed it up??

TheRain wrote:

Why is int faster than u8? Isn't u8 unsigned integer 0-255? Also... why does simply adding static inline to the function speed it up??

becasue the arm-cpu has 32bit registers.

yes, but... u8 is 8 bits, right? so it shouldn't take MORE register access than using a 32 bit value, right? or is the point simply that it's not necessary to call it 8 bits when it's in a 32 bit register...

I feel like I need to study ARM architecture now...

All of the registers are 32-bit, so in order to pass the value as if it were only 8 bits, the compiler generates instructions to mask off the upper bits of the value. That's why it's slower.

On the GBA, doing 32-bit accesses to VRAM/EWRAM was a bit slower since they're on a 16-bit bus. On DS, I think the cache always loads and stores 32 byte chunks anyway, so it makes no difference. And on either system, during calculations when a variable will be kept in a register, 32-bit is much better.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

TheRain wrote:

Why is int faster than u8?

Because ints are the native datatype of ARM processors. In fact, ints are the only datatypes of ARM processors; everything else (i.e., bytes and halfwords) has to be forces to their respective sizes with extra instructions.

TheRain wrote:

Isn't u8 unsigned integer 0-255?

Yeah, but that doesn't matter. What matters is the CPU register size.

TheRain wrote:

Also... why does simply adding static inline to the function speed it up??

'static inline' is the way you tell GCC to insert the routines directly into the callers at compile-time. They're basically macros with function syntax, but without the safety issues that macros are subject to.
As to why this would matter, consider this example:

Code:

// 16bit color via function
u16 rgbFun(u8 r, u8 g, u8 b)
{ return r | (g<<5) | (b<<10) | BIT(15); }

// 16bit color via inline function
static inline u16 rgbInl(u8 r, u8 g, u8 b)
{ return r | (g<<5) | (b<<10) | BIT(15); }

// Usage:
clr1= rgbFun(31,0,0);
clr2= rgbInl(31,0,0);

Both clr1 and clr2 would be 31 (full red), but one call would be much faster than the other. As I said earlier, functions are standalone items and GCC has to compile them as generic as possible because it can't guess at what the input would be. Also, when calling a function it never knows what that function will do, so it can't make assumptions on that front either. The upshot of this is that there are no optimizations that GCC is allowed to do. In this case, the whole procedure for the call to rgbFun() would be

putting the function arguments (31,0,0) into registers.
Possibly marking the input down to u8 instead of the usual u32.
Call the function (slow compared to normal instuctions)
Do the function's actual work, piece by piece
set-up the return value
Jump back to the caller (slow)
(both caller and called function may have to perform (slow) stackwork)

In contrast, the inline function is integrated into the caller, at which time it becomes possible to do optimizations, and in this case the call to rgbInl() amounts to

clr2= 0x801F;

which is a single instruction.

TheRain wrote:

yes, but... u8 is 8 bits, right? so it shouldn't take MORE register access than using a 32 bit value, right? or is the point simply that it's not necessary to call it 8 bits when it's in a 32 bit register...

I feel like I need to study ARM architecture now...

These things are not specific to ARM architecture, these are general programming concepts, usually covered in (game)programming books or sites. Google can help you with basic optimization tips.

Cearn wrote:

TheRain wrote:

Why is int faster than u8?

.

You are a brilliant person! thanks, much sense is made now and perhaps hopefully more sense can be made in the future because of this.

just another quick question... how can I take a look at the assembly code? I'm using makefiles from the libnds example code to compile and I don't see any assembler files being generated... are they being deleted?

also, if I'm doing divisions... will it be much more optimal to use div32 than to use the / operator??

add -save-temps to your CFLAGS, this will save the assembly files in the build directory.

div32 will be faster than the / operator but be careful, these functions are neither atomic nor reentrant. At one time libnds and libgba used the bios functions to override the / and % operators but this proved problematic with unsigned numbers.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

TheRain wrote:

additionally, I am using this code to write the backbuffer to the frontbuffer, and I'm not sure if it's the right thing to use...

Code:

dmaCopy(backBuffer,frontBuffer,256*256*2);

You'll probably get better performance with a ldmia/stmia loop:

Code:

//in a C file somewhere
extern void FastCopy(void* Src, void* Dest, u32 NumBytes, u32 Size);

@ in an ASM (.s) file
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ FastCopy
@ Copies memory really fast.
@ Inputs:
@ -r0: Source
@ -r1: Destination
@ -r2: # of bytes
@ -r3: Copy size (0=byte, 1=short (2 bytes), 2=int (4 bytes))
@ Notes:
@ -When there are less than 44 bytes remaining, they're copied one at a time.
@ You can copy less than 44 bytes, but it's not going to be very fast.
@ -The size parameter is provided to get around memory access limitations of
@ the Nintendo DS. Limitations that I know of are:
@ -Cannot write 8 bits to VRAM
@ -Cannot write 8 or 32 bits to GBA ROM
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

.global FastCopy
.type FastCopy,function
FastCopy:
stmfd sp!, {r3-r12, lr}
sub sp, sp, #4
str r3, [sp] @We need this later, but for now, r3 can assist in copying.

.loop:
cmp r2, #44
blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
b .loop

@copy remaining bytes one byte/short/int at a time, so we get them all.
.part2:
@add r2, r2, #4
ldr r3, [sp] @Find unit size
add sp, sp, #4
cmp r3, #0
beq .loop2_byte @0=byte
subs r3, r3, #1
beq .loop2_short @1=short
@otherwise it must be 2=int (we'll just pretend anything else is 2)

.loop2_int:
subs r2, r2, #1
bmi .end
ldr r3, [r0, #1]!
str r3, [r1, #1]!
b .loop2_int

.loop2_short:
subs r2, r2, #1
bmi .end
ldrh r3, [r0, #1]!
strh r3, [r1, #1]!
b .loop2_short

.loop2_byte:
subs r2, r2, #1
bmi .end
ldrb r3, [r0, #1]!
strb r3, [r1, #1]!
b .loop2_byte

.end:
@add r2, r2, #1
ldmfd sp!,{r3-r12,lr}
bx lr

.pool

_________________
I'm a PSP hacker now, but I still <3 DS.

Thanks everyone.... all of your advise has been a big help! I have an opacity algorithm that's running decently now and hopefully I can improve it even more soon.

gbadev.org forum archive

DS development > Advise on fast pixel manipulation

#91864 - TheRain - Mon Jul 10, 2006 8:22 am

#91866 - Dwedit - Mon Jul 10, 2006 8:48 am

#91874 - kevinc - Mon Jul 10, 2006 9:43 am

#91885 - Mighty Max - Mon Jul 10, 2006 12:17 pm

#91890 - Cearn - Mon Jul 10, 2006 1:08 pm

#91947 - TheRain - Mon Jul 10, 2006 6:48 pm

#91953 - kusma - Mon Jul 10, 2006 7:16 pm

#91957 - TheRain - Mon Jul 10, 2006 7:34 pm

#91960 - DekuTree64 - Mon Jul 10, 2006 8:02 pm

#91962 - Cearn - Mon Jul 10, 2006 8:03 pm

#91963 - TheRain - Mon Jul 10, 2006 8:35 pm

#91970 - TheRain - Mon Jul 10, 2006 9:41 pm

#91988 - wintermute - Mon Jul 10, 2006 11:40 pm

#92023 - HyperHacker - Tue Jul 11, 2006 4:45 am

#92048 - TheRain - Tue Jul 11, 2006 9:34 am