#146660 - Tramboi - Fri Dec 07, 2007 12:03 pm
Hi everybody,
I'm trying to optimize a scaler that ends up writing to the VRAM like this:
.. computations
STRH r9, [r0],#2
STRH r10,[r0],#2
STRH r11,[r0],#2
STRH r12,[r0],#2
(I write 256 pixels per line, ie 64 iterations of this)
Now I have a few questions for you :)
1) Assuming a bit of swizzling before was free, is it faster to store 2 32bits instead of 4 16bits in the VRAM?
2) Assuming it is, we should have
STR r9,[r0],#4
STR r10,[r0],#4
would a STM be faster?
STM
3) Is it recommended to do all the computations in very fast memory (cached or TCM if I get it right) then DMA the result to VRAM?
Remarks:
From here:
http://meraman.dip.jp/index.php?cmd=read&page=M3DSS_GBATEK_NDS&p=1#content_1_12
I read these timings for VRAM::
N32=5 S32=2 N16=4 S16=1
STR seems to cost 2*N cycles
STM seems to cost (n-1)S+2N
Can I deduce that the first version costs 4 * (2 * N16) = 32 cycles ?
Can I deduce the second version costs 2 * (2 * N32) = 20 cycles ?
Can I deduce the STM version costs S32 + 2 * N32 = 12 cycles?
Thanks in advance for clarifying all this, I'm trying to make sense of all the info I find on the Web and it can be overwhelming. I'm used to much faster CPUs with complex pipelines and prefetching! :)
Cheers,
Tramb
I'm trying to optimize a scaler that ends up writing to the VRAM like this:
.. computations
STRH r9, [r0],#2
STRH r10,[r0],#2
STRH r11,[r0],#2
STRH r12,[r0],#2
(I write 256 pixels per line, ie 64 iterations of this)
Now I have a few questions for you :)
1) Assuming a bit of swizzling before was free, is it faster to store 2 32bits instead of 4 16bits in the VRAM?
2) Assuming it is, we should have
STR r9,[r0],#4
STR r10,[r0],#4
would a STM be faster?
STM
3) Is it recommended to do all the computations in very fast memory (cached or TCM if I get it right) then DMA the result to VRAM?
Remarks:
From here:
http://meraman.dip.jp/index.php?cmd=read&page=M3DSS_GBATEK_NDS&p=1#content_1_12
I read these timings for VRAM::
N32=5 S32=2 N16=4 S16=1
STR seems to cost 2*N cycles
STM seems to cost (n-1)S+2N
Can I deduce that the first version costs 4 * (2 * N16) = 32 cycles ?
Can I deduce the second version costs 2 * (2 * N32) = 20 cycles ?
Can I deduce the STM version costs S32 + 2 * N32 = 12 cycles?
Thanks in advance for clarifying all this, I'm trying to make sense of all the info I find on the Web and it can be overwhelming. I'm used to much faster CPUs with complex pipelines and prefetching! :)
Cheers,
Tramb