gbadev.org forum archive

Hi everybody,
I'm trying to optimize a scaler that ends up writing to the VRAM like this:

.. computations
STRH r9, [r0],#2
STRH r10,[r0],#2
STRH r11,[r0],#2
STRH r12,[r0],#2

(I write 256 pixels per line, ie 64 iterations of this)

Now I have a few questions for you :)

1) Assuming a bit of swizzling before was free, is it faster to store 2 32bits instead of 4 16bits in the VRAM?

2) Assuming it is, we should have
STR r9,[r0],#4
STR r10,[r0],#4

would a STM be faster?
STM

3) Is it recommended to do all the computations in very fast memory (cached or TCM if I get it right) then DMA the result to VRAM?

Remarks:

From here:
http://meraman.dip.jp/index.php?cmd=read&page=M3DSS_GBATEK_NDS&p=1#content_1_12

I read these timings for VRAM::
N32=5 S32=2 N16=4 S16=1

STR seems to cost 2*N cycles
STM seems to cost (n-1)S+2N

Can I deduce that the first version costs 4 * (2 * N16) = 32 cycles ?
Can I deduce the second version costs 2 * (2 * N32) = 20 cycles ?
Can I deduce the STM version costs S32 + 2 * N32 = 12 cycles?

Thanks in advance for clarifying all this, I'm trying to make sense of all the info I find on the Web and it can be overwhelming. I'm used to much faster CPUs with complex pipelines and prefetching! :)

Cheers,
Tramb

My advice: Do it both ways and use a profiler to find which one's faster. The most quick-and-dirty profiler consists of writing to BG_PALETTE[0] before and after a function, which changes the width of a horizontal stripe across the background by an amount in proportion to the runtime of a function. The next simplest is one based on starting a CPU timer before a function and reading it at the end.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

gbadev.org forum archive

DS development > DS Beginner : Writing to VRAM

#146660 - Tramboi - Fri Dec 07, 2007 12:03 pm

#146671 - tepples - Fri Dec 07, 2007 1:46 pm