gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Memory tricks? GBA fill rate

#83822 - devmelon - Thu May 18, 2006 8:40 pm

Hi everyone.

I'm trying to fill the back buffer as fast as possible, but my routine is ridiculously slow. It looks like 5-6 fps or so. Are there faster memory accessing ways or some other trick to make it faster? How the hell can people make cool looking fullscreen effects at reasonable framerates when I fail? :) I'm puzzled.

Are there tricks for making this routine go faster?

r1 = pixel offset from video buffer start
r2 = pointer to vram back buffer
r3 = page size in bytes
r4 = pixel to be written

Code:
...
...
...
.L1:
   strh r4, [r2, r1]      @ Write pixel to video buffer
   add r1, r1, #0x2       @ Point to next pixel
   add r4, r4, #0x1       @ Change pixel color with something
   cmp r1, r3             @ If we have not reached the
   bne .L1                @    end of buffer, continue
...
...
...

#83830 - kusma - Thu May 18, 2006 9:02 pm

yes.
use pointer-walking rather than indexing. that way you can use the post-increment adressing-mode. also unroll the loop like there was no tomorrow if the loop is time-critical. (usually around 8-16 unrolled iterations are enough). that way you don't get the penalty for the compare and branch for every pixel.

also, is the code running from iwram in arm-mode? if it's time-critical, then it really should. 5-6 fps sounds too slow even with the above optimizations...

#83832 - devmelon - Thu May 18, 2006 9:13 pm

kusma wrote:
yes.
use pointer-walking rather than indexing. that way you can use the post-increment adressing-mode.

I'm quite new to ARM assembler. I've been doing MIPS a little before this.. But is it like
Code:
strh r4, [r1]!
or something like that?
kusma wrote:
also unroll the loop like there was no tomorrow if the loop is time-critical. (usually around 8-16 unrolled iterations are enough). that way you don't get the penalty for the compare and branch for every pixel.
This "unrolling" simply means to repeat several memwrites after each other? Well, I can see how that makes sense.
kusma wrote:
also, is the code running from iwram in arm-mode?
Can you move instructions elsewhere? How? I just have plain ol' instructions lined up in a .s file. No magic.
kusma wrote:
if it's time-critical, then it really should. 5-6 fps sounds too slow even with the above optimizations...

Yeah, it sounds quite slow... What fill-rate can one achieve with some tricks? Can I expect 24+ which is a reasonable rate for smooth animations?
Thanks alot by the way! :) I'll get around and test some unrolling practices right away.

#83836 - Quirky - Thu May 18, 2006 9:36 pm

That's the idea. I filled triangles in a tile mode thingy with this:
Code:

.rept  31
  str   r0,[r1], #32   @ store the 8 pixels and move on a tile
.endr

and jump to a suitable place in that loop depending on line length. For regular bitmap modes it'd be ,#4.

Clear screen can be done even faster with the stmia instruction (push everything onto the stack, store the stack, set r1-14 to 0, stmia r0!,{r1-r14}, pop the lot)

But it's been a while since I coded in arm so I can't remember too much.

#83841 - devmelon - Thu May 18, 2006 10:23 pm

Ah, that's clever. I didn't know much about preprocessor directives (as I assume they are). Well, I came to the conclusion that I need more meat on my legs if I am about to stand a battle with assembling software for GBA. I'm currently plowing though the ARM7TDMI Data Sheet.

Could someone direct me to further reading? Link(s) would be greatly apprechiated! :)

I am starting to grasp the difference between ARM/THUMB mode, and stuff like that. Also the CPSR and so. But I doubt this will go though preprocessor directives (?). It doesn't sound like it has anything to do with the processor model.

EDIT:
Can someone tell me the difference of MOV vs MOVS, SUB vs SUBS ? I could not find information about it in this pdf.

#83844 - DekuTree64 - Thu May 18, 2006 10:57 pm

Faster still:

Code:
stmfd sp!, {r4-r11, lr}

mov r0, #0
mov r1, #0
mov r2, #0
mov r3, #0
mov r4, #0
mov r5, #0
mov r6, #0
mov r7, #0
mov r8, #0
mov r9, #0
mov r10, #0
mov r11, #0

@ Probably want to make this an argument, if
@ you want to pass in an EWRAM backbuffer
mov r12, #0x6000000  @VRAM start
@ Size of buffer, in bytes (this is assuming mode3)
mov r14, #240 * 160 * 2

loop:
stmia r12!, {r0-r11}    @ Store 48 bytes, and update r12
stmia r12!, {r0-r11}
stmia r12!, {r0-r11}
stmia r12!, {r0-r11}
stmia r12!, {r0-r11}
stmia r12!, {r0-r11}
stmia r12!, {r0-r11}
stmia r12!, {r0-r11}
@ Subtract 12 regs, 4 bytes each, repeated 8 times
@ Be careful that the total size is a multiple of this amount
subs r14, r14, #12 * 4 * 8
bgt loop

ldmfd sp!, {r4-r11, lr}
bx lr

subs is subtract and set flags. Basically a sub and cmp in one.

Of course, the fastest of all is if your effect can redraw the entire screen so you don't have to clear it in the first place.

As for expected fill rate, with 240 * 160 pixels, and 280896 CPU cycles per 60Hz frame, that leaves you 7.315 cycles per pixel to run at 60Hz. Filling the way I did above should average slightly over 1 cycle per 16-bit pixel, leaving you with a bit over 6 cycles to do your effect.

If you use a backbuffer in EWRAM, it will take significantly longer to fill, plus you'll have to copy the buffer to VRAM. By then you've probably used up your 7 cycles already, so you'll be 30Hz at most.

I'd recommend either using mode4, or stretching the mode3 layer 2x horizontally so you get half resolution, but can double buffer by scrolling it left and right each frame.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#83853 - devmelon - Thu May 18, 2006 11:55 pm

So the -s postfix is just letting the CPSR register condition flags to be written? Cool. In fact, I just read that sorta, and popped back here. Still having a hard time understanding the usage for IRQs, but I'm sure it'll fill in when I read on.

Dang. 7 cycles per pixel isn't much bargain material. But then again, 25-30 fps is a reasonable frame rate for me and ~15 cycles are a little bit more than 7. I guess one must be extremely careful with overhead when coding for GBA, if you're attempting raw videobuffer writes.

I usually write to the back buffer in VRAM (using double buffered modes 4 or 5). I can clearly see how update regions could benefit for complex drawing algorithms.

If I understood this sheet correctly, branches flushes the instruction fetch pipeline, causing it to refill at some extra cycle costs? Branches are evil :(

Hmm.. There is no way that you could make games with a moderate complex graphics on the GBA. Not much left for the game logic if you waste cycles per pixel. But on the other hand, if you can spare one cycle per pixel, you get 38400 extra cycles for game logic, minus the overhead...

BTW: WRAM (In-chip), is it free to use any way I like? Do I keep track of all data put there or how do I manage memory usage? http://www.work.de/nocash/gbatek.htm#memorymap If I use mode 5, can I use palette memory for general purposes (considering its still faster than some other memories)?

I have so many questions, but asking too many is likely to confuse you aswell as myself. Perhaps I shouldn't be concerned about all the magic optimizations yet, considering I have yet to learn the full language specification.

You know what? :)
I've grown fond of this! I love problems! :D

#83859 - kusma - Fri May 19, 2006 12:51 am

devmelon wrote:

If I understood this sheet correctly, branches flushes the instruction fetch pipeline, causing it to refill at some extra cycle costs? Branches are evil :(


yes. conditional instructions are usually better for small chunks of conditional code. and unrolling really helps a lot on arm7s.

devmelon wrote:

Hmm.. There is no way that you could make games with a moderate complex graphics on the GBA. Not much left for the game log'ic if you waste cycles per pixel. But on the other hand, if you can spare one cycle per pixel, you get 38400 extra cycles for game logic, minus the overhead...


well, keep in mind how your cycles scale. there's usually a LOT more pixels than triangles in a 3d scene, and even less game objects. i'm not saying it's easy to do complex 3d (or anything else for that matter), but it sure is possible to go beyond what we see in todays games. if it's something to invest money into, is a different question ;)

devmelon wrote:

BTW: WRAM (In-chip), is it free to use any way I like? Do I keep track of all data put there or how do I manage memory usage? http://www.work.de/nocash/gbatek.htm#memorymap If I use mode 5, can I use palette memory for general purposes (considering its still faster than some other memories)?


VRAM and IWRAM is in short stock, so you should really consider what to put there. in our "complex, high performace" 3d-stuff, we really just have the polyfillers and audio-mixers there, togeither with some small buffers and some coordinate-transform code.

#83893 - devmelon - Fri May 19, 2006 11:23 am

I'm not looking into making 3d just yet. I was more thinking about blending and various effects. But then again, small objects doesn't require the whole screen to be redrawn.

I'm thinking like particle effects. Sure, in palette mode, you could make easy addition of pixels given that some of the palette is sequential color variation order (0..1..2..---...31 for say grayscale color values). But more complex blending would require so many memory accesses. I'm assuming that loading several registers pararell (ldmfd) could speed up this process somewhat. But things like pixel[i] = (pixel[i] + src[i]) >> 1; would require alot of work if done in 15 bit mode.

Masking out RGB values and changing them individually, then merging it back to one half word and rewrite would take more than 7 cycles.

One instruction for loading src from memory, another three for expanding RGB into three registers, repeat this for dst, add these rgbs together and shift 1, then slam back into 15 bit format and write back.... It quickly builds up :/

How fast are sprite blending? (Or was it BG blending?) Can I use tricks with ordered sprites that I write my data into and use their blend modes to speed things up?

What framerate should I aim for? 60 Hz seems a bit more than I really need.

#83895 - kusma - Fri May 19, 2006 11:35 am

devmelon wrote:

Masking out RGB values and changing them individually, then merging it back to one half word and rewrite would take more than 7 cycles.

One instruction for loading src from memory, another three for expanding RGB into three registers, repeat this for dst, add these rgbs together and shift 1, then slam back into 15 bit format and write back.... It quickly builds up :/


the trick here is to not actually do this seperately for every component, and to do multiple pixels in parallel. for additive blending, do something like two pixels at the time with a bit-layout like this (it's a simple matter of masking bits):
Code:

register 1: 000000ggggg000000rrrrr00000bbbbb
register 2: 0rrrrr00000bbbbb000000ggggg00000

to prevent overflow into the other components when adding. then mask out overflow and or the pixels togeither. that way you split the cost of the masking over the two pixels, and you don't have to separate each component out for each pixel. there are also a lot of additional tricks on how to do fast 15 bit color manipulations.

edit:
oh, and there's a quite neat trick to generate a saturation-mask based on overflow bits described by Mikael Kalms in some hugi (try the special edition - the coders digest, if you're interressed in loads of old and mostly outdated coding docs :P).

#83911 - Cearn - Fri May 19, 2006 3:07 pm

Here's something :)

EDIT: argh, or maybe not. I thought I had to do it like that to make sure things didn't bleed into the rest of the fields, but apparently not. BRB.

#83915 - devmelon - Fri May 19, 2006 4:11 pm

Ah, I see. So, if I'd like to add some brightness (+1) to all pixels (all channels), I'd add 0x00200401 to register 1, and 0x04010020 to register 2, mask the result with AND to prevent overflow, then ORR together the two registers and write it to buffer?

Still, it's alot of job to do per pixel :) On the other hand, we're working with 4 pixels at a time, so I see where you're going. That's pretty clever!

Thanks alot all. I've gotton more information than I asked for and I am grateful. I'll take some time now to experiment with my new found powers and fool around a little. And maybe swig a beer ^^

#112526 - nornagon - Sun Dec 17, 2006 2:54 pm

Here's a nice memset routine I found in tonclib:

Code:

@ void memset32(void *dst, u32 val, u32 n)
  .section .itcm,"ax", %progbits
  .align 2
  .code 32
  .global memset32
memset32:
  and   r12, r2, #7
  movs  r2, r2, lsr #3
  beq  .Lres_set32
  stmfd sp!, {r4-r10}
  mov   r3, r1
  mov   r4, r1
  mov   r5, r1
  mov   r6, r1
  mov   r7, r1
  mov   r8, r1
  mov   r9, r1
  mov   r10, r1
.Lmain_set32:
    stmia r0!, {r3-r10}
    subs r2, r2, #1
    bhi .Lmain_set32
  ldmfd sp!, {r4-r10}
.Lres_set32:
    subs    r12, r12, #1
    stmcsia r0!, {r1}
    bcs     .Lres_set32
  bx    lr


There are 16-bit versions as well as memcpy routines.

#112530 - keldon - Sun Dec 17, 2006 4:28 pm

I once created a memory transfer routine where you do a certain amount of copies without checking where you are in the loop.

So you have something like:
Code:
void copy ( int *src, int *dst, int len ){
   int i;
   for ( i = 0; i < len & (~127); i += (4*16) ){
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
      *(dst++) = *(src++);
   }
   for ( ; i < len; i += 4 ){
      *(dst++) = *(src++);
   }
}

#112540 - Ant6n - Sun Dec 17, 2006 7:51 pm

i was thinking for 15 bgr one could keep the last bit in every color 0, i.e. use
0bbbb0gggg0rrrr0. Then if you want to blend you could add two colors and shift the result right by one. if you want to blend again afterwards you just OR with 0111101111011110. one could even do 2 pixels at a time.

you wanted links, here is something to read: http://www.ee.ic.ac.uk/pcheung/teaching/ee2_computing/arm/Progtech.pdf

#112569 - Dwedit - Sun Dec 17, 2006 11:21 pm

Keep in mind that intensity 0-15 is too dark to see on a regular GBA....
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#112583 - Ant6n - Mon Dec 18, 2006 5:56 am

i meant to zero the least significant bit, not the most. so instead of 0,1,2,3,4,5...31 you'd have 0,2,4,..,30

#128537 - Janekxx - Sat May 12, 2007 3:04 pm

Hi,

Can You explain me how exactly this works?
Code:
len & (~127)


I'm not sure, I was trying to find explaination on google but without success.


Thank You.
_________________
sony.int.pl

#128546 - kusma - Sat May 12, 2007 4:05 pm

It return len with the bottom 7 bits cleared. 127 in binary is 1111111, ~127 is the bitwise compliment, ie 11111111111111111111111110000000b. When you're and'ing a number with that mask, you're only keeping the higher bits of that number.

It's used to find how many "full" iterations can be done. Then fix-up for the remaining iterations can be done afterwards.