gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > Full screen redraw (effect-based) game - slow DMA copying!

#100850 - HotChilli - Thu Aug 31, 2006 3:00 am

Hi!

I try make some effect-based game or demo. I.e. smoke, optical effects or visual effects like in MediaPlayer or WinAmp. But... Here is a problem....

I use static arrays (as backbuffers) for both screens but they copy to v-memory too slow! Copying (through "dmaCopy") from static array performs 5,1 ms. For 60fps i must perform all operations in 16 ms. 16 - 2*5,1 = 5ms! Too small for effect logic, game logic or something else. :(

My question is - can i redraw background (256*192 pixels) in less than 5 ms?

#100884 - bjoerngiesler - Thu Aug 31, 2006 11:08 am

Doesn't dmaCopy() return immediately, effectively parallelizing the copying and the rest of the code? If so, you have no problem.
_________________
DSFTP homepage

#100888 - melw - Thu Aug 31, 2006 12:44 pm

If you use framebuffer with two banks, it's easy to achieve 60fps output no matter what you do... Something like this:

Code:
//screen init stuff:

u16* backBuffer;
u16* frontBuffer;
int screenBufferActive=0;

videoSetMode(MODE_FB1);
...
vramSetMainBanks(VRAM_A_LCD, VRAM_B_LCD, VRAM_C_SUB_BG, VRAM_D_SUB_SPRITE);

backBuffer = VRAM_A;
frontBuffer = VRAM_B;

// while blitting the screen:

swiWaitForVBlank();

if(screenBufferActive==1) {
   videoSetMode(MODE_FB1);
   frontBuffer = VRAM_B;
   backBuffer = VRAM_A;
   screenBufferActive = 0;
} else {
   videoSetMode(MODE_FB0);
   frontBuffer = VRAM_A;
   backBuffer = VRAM_B;
   screenBufferActive = 1;
}


This assuming you're drawing always to backBuffer.

#100905 - josath - Thu Aug 31, 2006 5:04 pm

bjoerngiesler wrote:
Doesn't dmaCopy() return immediately, effectively parallelizing the copying and the rest of the code? If so, you have no problem.

No, the libnds dmaCopy* functions wait until the copy is done until they return.

#101037 - HyperHacker - Fri Sep 01, 2006 9:17 pm

This is what I use. I don't know how fast it is, but unlike with dmaCopy() I've yet to run out of VBlank time using it.

Code:
.arm
.align

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ FastCopy
@ Copies memory really fast.
@ Inputs:
@  -r0: Source
@  -r1: Destination
@  -r2: # of bytes
@  -r3: Copy size (0=byte, 1=short (2 bytes), 2=int (4 bytes))
@ Notes:
@  -When there are less than 44 bytes remaining, they're copied one at a time.
@   You can copy less than 44 bytes, but it's not going to be very fast.
@  -The size parameter is provided to get around memory access limitations of
@   the Nintendo DS. Limitations that I know of are:
@   -Cannot write 8 bits to VRAM
@   -Cannot write 8 or 32 bits to GBA ROM
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

.global FastCopy
.type FastCopy,function
FastCopy:
stmfd sp!, {r3-r12, lr}
sub sp, sp, #4
str r3, [sp] @We need this later, but for now, r3 can assist in copying.

.loop:
cmp r2, #44
blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
b .loop


@copy remaining bytes one byte/short/int at a time, so we get them all.
.part2:
@add r2, r2, #4
ldr r3, [sp] @Find unit size
add sp, sp, #4
cmp r3, #0
beq .loop2_byte @0=byte
subs r3, r3, #1
beq .loop2_short @1=short
@otherwise it must be 2=int (we'll just pretend anything else is 2)


.loop2_int:
subs r2, r2, #1
bmi .end
ldr r3, [r0, #1]!
str r3, [r1, #1]!
b .loop2_int


.loop2_short:
subs r2, r2, #1
bmi .end
ldrh r3, [r0, #1]!
strh r3, [r1, #1]!
b .loop2_short


.loop2_byte:
subs r2, r2, #1
bmi .end
ldrb r3, [r0, #1]!
strb r3, [r1, #1]!
b .loop2_byte


.end:
@add r2, r2, #1
ldmfd sp!,{r3-r12,lr}
bx lr

.pool


Prototyped like so:
Code:
#ifdef __cplusplus //Non-mangled names plz
extern "C" {
#endif

extern void FastCopy(void* Src, void* Dest, u32 NumBytes, u32 Size);

#ifdef __cplusplus
}
#endif //__cplusplus


I've been trying to figure out how to put it in ITCM for even more speed, no luck though.
_________________
I'm a PSP hacker now, but I still <3 DS.

#101043 - Cearn - Fri Sep 01, 2006 9:55 pm

Unless ARM instructions have changed considerably between ARM7 and ARM9, that code is very unsafe, and in some cases non-optimal.

Code:
.global FastCopy
.type FastCopy,function
FastCopy:
    stmfd sp!, {r3-r12, lr}
    sub sp, sp, #4
    str r3, [sp] @We need this later, but for now, r3 can assist in copying.

You're storing r3 twice here, for no apparent reason. Once is enough.

Code:
.loop:
  cmp r2, #44
  blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
  ldmia r0!, {r3-r12, lr}
  stmia r1!, {r3-r12, lr}
  sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
  b .loop

You're assuming r0 and r1 are word aligned, which would result in incorrect copying if they're not. They usually will be, but a general routine should check for misalignments.
Also, there is no need for two branches in the loop, one is enough. This goes for all the loops in the routine. The compare is also redundant, or at least could be taken out of the loop.

Code:
  ldr r3, [sp]
  add sp, sp, #4

This is simply a 'pop {r3}'

Code:
.loop2_int:
  subs r2, r2, #1
  bmi .end
  ldr r3, [r0, #1]!
  str r3, [r1, #1]!
  b .loop2_int

This loop fails completely because words need a #4 offset, not #1. Same goes for the halfword loop. And since r2 is the byte count, not the wordcount, it executes 3x too many.

Code:
#ifdef __cplusplus //Non-mangled names plz
extern "C" {
#endif

extern void FastCopy(void* Src, void* Dest, u32 NumBytes, u32 Size);

#ifdef __cplusplus
}
#endif //__cplusplus

Use `const void*' for the source; this saves the users the trouble of having to cast away the constness of the data.

#101087 - HyperHacker - Sat Sep 02, 2006 7:25 am

There are good reasons for some of that, actually.

Cearn wrote:
Code:
.global FastCopy
.type FastCopy,function
FastCopy:
    stmfd sp!, {r3-r12, lr}
    sub sp, sp, #4
    str r3, [sp] @We need this later, but for now, r3 can assist in copying.

You're storing r3 twice here, for no apparent reason. Once is enough.

This is because I need to copy, then restore r3 and not all those other registers. I don't know of any way to push multiple registers and then only pop one, so I did push all, push r3, copy, pop r3, do some more, pop all.

Quote:
Code:
.loop:
  cmp r2, #44
  blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
  ldmia r0!, {r3-r12, lr}
  stmia r1!, {r3-r12, lr}
  sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
  b .loop

You're assuming r0 and r1 are word aligned, which would result in incorrect copying if they're not. They usually will be, but a general routine should check for misalignments.
Also, there is no need for two branches in the loop, one is enough. This goes for all the loops in the routine. The compare is also redundant, or at least could be taken out of the loop.

True, but I don't know what exactly I would do if they're not aligned. I don't see how I can remove the second branch, and the compare is how it knows when to stop (notice r2 is subtracted).

For the rest, it's mainly because I've never written ARM ASM before. >_>
_________________
I'm a PSP hacker now, but I still <3 DS.

#101092 - Cearn - Sat Sep 02, 2006 9:58 am

HyperHacker wrote:
There are good reasons for some of that, actually.

Cearn wrote:
Code:
.global FastCopy
.type FastCopy,function
FastCopy:
    stmfd sp!, {r3-r12, lr}
    sub sp, sp, #4
    str r3, [sp] @We need this later, but for now, r3 can assist in copying.

You're storing r3 twice here, for no apparent reason. Once is enough.

This is because I need to copy, then restore r3 and not all those other registers. I don't know of any way to push multiple registers and then only pop one, so I did push all, push r3, copy, pop r3, do some more, pop all.

stmfd will place the registers in the list in memory in the order of the register-indices, in your case r3 would have the lowest address (i.e., the one that sp will point to in the end), then r4 above that, then r5 etc. This should suffice:
Code:
FastCopy:
    push    {r3-r11,lr} @identical to stmfd sp!, {r3-r11,lr}
   
    @ copy main blocks
   
    pop     {r3}        @ pop r3    ( same as ldmfd sp!, {r3} )

    @ copy smaller stuff

    pop     {r4-r11,lr} @ pop rest  ( same as ldmfd sp!, {r4-r11,lr}
    bx  lr


Quote:

Quote:
Code:
.loop:
  cmp r2, #44
  blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
  ldmia r0!, {r3-r12, lr}
  stmia r1!, {r3-r12, lr}
  sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
  b .loop

You're assuming r0 and r1 are word aligned, which would result in incorrect copying if they're not. They usually will be, but a general routine should check for misalignments.
Also, there is no need for two branches in the loop, one is enough. This goes for all the loops in the routine. The compare is also redundant, or at least could be taken out of the loop.

True, but I don't know what exactly I would do if they're not aligned. I don't see how I can remove the second branch, and the compare is how it knows when to stop (notice r2 is subtracted).

If the both source and destination cannot be resolved to word alignment then you're pretty much screwed; you'd have to do it the slow way then. Although you could always unroll part of the loop I guess. As for the basic copy loop, they usually look something like this:

Code:
    cmp     r2, #4
    bcc     .Lcpy_end       @ Don't copy if there's nothing to do
.Lcpy_loop:
    ldr     r3, [r0], #4    @ tmp= *r0++;
    str     r3, [r1], #4    @ *r0++ = tmp
    subs    r2, r2, #4      @ count -= 4
    bhi     .Lcpy_loop      @ if(count>0) goto .Lcpy_loop
.Lcpy_end:
    @stuff

The body of the loop needs only a load, a store, a subtract and a jump. That you have to make a compare outside the loop to test if it's worth going into the loop is of little concern, as it'll only be executed once anyway. In some cases you might even get away with
Code:
.Lcpy_loop:
    subs    r2, r2, #4      @ count -= 4
    ldrcs   r3, [r0], #4    @ if(count>=0) tmp= *r0++;
    strcs   r3, [r1], #4    @ if(count>=0) *r0++ = tmp
    bhi     .Lcpy_loop      @ if(count>0)  goto .Lcpy_loop

if you already know r2 won't be used later.

Quote:
For the rest, it's mainly because I've never written ARM ASM before. >_>
In that case, be even more careful :P
It's very easy to make mistakes in asm, especially when it concerns either alignment of data or off-by-one errors in loops. In fact, I think I may have just discovered one of those in my own FastCopy routine :P. Always check for potential breaks in code; this is true in C, and is doubly true in asm.

#101162 - HotChilli - Sun Sep 03, 2006 2:37 pm

Thanx, guys. :) But here is a stupid question. How can i compile asm-arm code in devkitPro? :))

PS: _asm { } don't work too... Maybe some special compiler parameters?

#101167 - gmiller - Sun Sep 03, 2006 3:02 pm

If you want to do it in with your C++ code make sure the name of the file is one that tells gcc that you are doing C++ (Like .cpp ). If you want a stand alone file just name it (.s or .S). C as a language does not support inline assembly so putting the code in a C source file is problematic at best. There is a command line switch that you can use to force the type so the language parser does not need to guess as well. Sorry for not giving direct examples but my msys is broken right now because I switched to another laptop and tried to just copy it (what a fool). Hope this helps.

#101189 - HotChilli - Sun Sep 03, 2006 7:16 pm

gmiller wrote:
If you want to do it in with your C++ code make sure the name of the file is one that tells gcc that you are doing C++ (Like .cpp ). If you want a stand alone file just name it (.s or .S). C as a language does not support inline assembly so putting the code in a C source file is problematic at best. There is a command line switch that you can use to force the type so the language parser does not need to guess as well. Sorry for not giving direct examples but my msys is broken right now because I switched to another laptop and tried to just copy it (what a fool). Hope this helps.


Yeah. Thanx. :) With .s files asm-code from message above compile properly. Makefile parameters (and compiler parameters) is:

asm.o: asm.s
arm-eabi-gcc -x assembler-with-cpp -c $< -o $@

Ok. But FastCopy works slower than Dma analog (!) with videomemory and don't working at all (!!!) with RAM memory (global arrays, i.e.).

Here is handmade profiler values =) (tested on DS hardware; functions make fullredraw (update 256*192 pixels))

FastCopy (RAM) - crash
FastCopy (VRAM) - 4.1 ms :(
dmaCopy (RAM) - 23.5 ms
dmaCopy (VRAM) - 3.1 ms
Copying through cycle - 13.9 ms
memcopy (RAM) - 8.8 ms
memcopy (VRAM) - 6.6 ms

And now i have investigated memset functions

memset (RAM) - 6.3 ms
memset (VRAM) - 4.4 ms
"dmaSet" (RAM) - 13.2 ms
"dmaSet" (VRAM) - 8.3 ms

And now one trouble. Dma analog of memset works slower than software memset! Ehh...

Here is my code for dmaset.
Code:

// global variable for dma
const uint32 dword_for_dmacopy[1] = { 0x00000000 };
...
...
DMA_SRC(3) = (uint32)dword_for_dmacopy;
DMA_DEST(3) = BG_BMP_RAM_SUB(0);
DMA_CR(3) = DMA_ENABLE | DMA_32_BIT | DMA_SRC_FIX | DMA_START_NOW | (256 * 192 / 2);
while(DMA_CR(3) & DMA_BUSY);


And now updated question. =) How we can write memset and memcopy for VRAM faster than 4.1 ms and 3.1 ms respectively?


Last edited by HotChilli on Sun Sep 03, 2006 7:25 pm; edited 1 time in total

#101190 - Sausage Boy - Sun Sep 03, 2006 7:23 pm

You could just skip the wait and do other stuff there instead.
_________________
"no offense, but this is the gayest game ever"

#101191 - HotChilli - Sun Sep 03, 2006 7:30 pm

Sausage Boy wrote:
You could just skip the wait and do other stuff there instead.


Yeah. That's work in particular case. But i.e. i write in bottom part of VRAM and next time dma erase my changes... That bugs me! :))

#101272 - HyperHacker - Mon Sep 04, 2006 3:08 am

FYI, when you do this:
while(something);
You're wasting CPU power doing essentially nothing, which is hard on the batteries if you do it a lot. I think swiDelay() (or was it swiSleep()? haven't done any DS stuff recently) actually stops the CPU for a while, which would be much easier on the batteries:
while(something) swiDelay(5);
Of course this means it could take up to 5 units (not sure what measurement that is) before it realizes DMA is done, so you'd want to adjust that number depending how fast you need the entire routine to be.
_________________
I'm a PSP hacker now, but I still <3 DS.

#101455 - HotChilli - Tue Sep 05, 2006 6:20 pm

HyperHacker wrote:
FYI, when you do this:
while(something);
You're wasting CPU power doing essentially nothing, which is hard on the batteries if you do it a lot. I think swiDelay() (or was it swiSleep()? haven't done any DS stuff recently) actually stops the CPU for a while, which would be much easier on the batteries:
while(something) swiDelay(5);
Of course this means it could take up to 5 units (not sure what measurement that is) before it realizes DMA is done, so you'd want to adjust that number depending how fast you need the entire routine to be.


Yeah. That's looks better. ;)

PS: Have you test your FastCopy function for different types of memory? VRAM? RAM? Is it only for VRAM?

#101517 - HyperHacker - Wed Sep 06, 2006 6:34 am

I designed it for VRAM but I tested it with ordinary RAM too.
_________________
I'm a PSP hacker now, but I still <3 DS.