gbadev.org forum archive

Hi!

I try make some effect-based game or demo. I.e. smoke, optical effects or visual effects like in MediaPlayer or WinAmp. But... Here is a problem....

I use static arrays (as backbuffers) for both screens but they copy to v-memory too slow! Copying (through "dmaCopy") from static array performs 5,1 ms. For 60fps i must perform all operations in 16 ms. 16 - 2*5,1 = 5ms! Too small for effect logic, game logic or something else. :(

My question is - can i redraw background (256*192 pixels) in less than 5 ms?

Doesn't dmaCopy() return immediately, effectively parallelizing the copying and the rest of the code? If so, you have no problem.
_________________
DSFTP homepage

If you use framebuffer with two banks, it's easy to achieve 60fps output no matter what you do... Something like this:

Code:

//screen init stuff:

u16* backBuffer;
u16* frontBuffer;
int screenBufferActive=0;

videoSetMode(MODE_FB1);
...
vramSetMainBanks(VRAM_A_LCD, VRAM_B_LCD, VRAM_C_SUB_BG, VRAM_D_SUB_SPRITE);

backBuffer = VRAM_A;
frontBuffer = VRAM_B;

// while blitting the screen:

swiWaitForVBlank();

if(screenBufferActive==1) {
videoSetMode(MODE_FB1);
frontBuffer = VRAM_B;
backBuffer = VRAM_A;
screenBufferActive = 0;
} else {
videoSetMode(MODE_FB0);
frontBuffer = VRAM_A;
backBuffer = VRAM_B;
screenBufferActive = 1;
}

This assuming you're drawing always to backBuffer.

bjoerngiesler wrote:

Doesn't dmaCopy() return immediately, effectively parallelizing the copying and the rest of the code? If so, you have no problem.

No, the libnds dmaCopy* functions wait until the copy is done until they return.

This is what I use. I don't know how fast it is, but unlike with dmaCopy() I've yet to run out of VBlank time using it.

Code:

.arm
.align

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ FastCopy
@ Copies memory really fast.
@ Inputs:
@ -r0: Source
@ -r1: Destination
@ -r2: # of bytes
@ -r3: Copy size (0=byte, 1=short (2 bytes), 2=int (4 bytes))
@ Notes:
@ -When there are less than 44 bytes remaining, they're copied one at a time.
@ You can copy less than 44 bytes, but it's not going to be very fast.
@ -The size parameter is provided to get around memory access limitations of
@ the Nintendo DS. Limitations that I know of are:
@ -Cannot write 8 bits to VRAM
@ -Cannot write 8 or 32 bits to GBA ROM
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

.global FastCopy
.type FastCopy,function
FastCopy:
stmfd sp!, {r3-r12, lr}
sub sp, sp, #4
str r3, [sp] @We need this later, but for now, r3 can assist in copying.

.loop:
cmp r2, #44
blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
b .loop

@copy remaining bytes one byte/short/int at a time, so we get them all.
.part2:
@add r2, r2, #4
ldr r3, [sp] @Find unit size
add sp, sp, #4
cmp r3, #0
beq .loop2_byte @0=byte
subs r3, r3, #1
beq .loop2_short @1=short
@otherwise it must be 2=int (we'll just pretend anything else is 2)

.loop2_int:
subs r2, r2, #1
bmi .end
ldr r3, [r0, #1]!
str r3, [r1, #1]!
b .loop2_int

.loop2_short:
subs r2, r2, #1
bmi .end
ldrh r3, [r0, #1]!
strh r3, [r1, #1]!
b .loop2_short

.loop2_byte:
subs r2, r2, #1
bmi .end
ldrb r3, [r0, #1]!
strb r3, [r1, #1]!
b .loop2_byte

.end:
@add r2, r2, #1
ldmfd sp!,{r3-r12,lr}
bx lr

.pool

Prototyped like so:

Code:

#ifdef __cplusplus //Non-mangled names plz
extern "C" {
#endif

extern void FastCopy(void* Src, void* Dest, u32 NumBytes, u32 Size);

#ifdef __cplusplus
}
#endif //__cplusplus

I've been trying to figure out how to put it in ITCM for even more speed, no luck though.
_________________
I'm a PSP hacker now, but I still <3 DS.

Unless ARM instructions have changed considerably between ARM7 and ARM9, that code is very unsafe, and in some cases non-optimal.

Code:

.global FastCopy
.type FastCopy,function
FastCopy:
stmfd sp!, {r3-r12, lr}
sub sp, sp, #4
str r3, [sp] @We need this later, but for now, r3 can assist in copying.

You're storing r3 twice here, for no apparent reason. Once is enough.

Code:

.loop:
cmp r2, #44
blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
b .loop

You're assuming r0 and r1 are word aligned, which would result in incorrect copying if they're not. They usually will be, but a general routine should check for misalignments.
Also, there is no need for two branches in the loop, one is enough. This goes for all the loops in the routine. The compare is also redundant, or at least could be taken out of the loop.

Code:

ldr r3, [sp]
add sp, sp, #4

This is simply a 'pop {r3}'

Code:

.loop2_int:
subs r2, r2, #1
bmi .end
ldr r3, [r0, #1]!
str r3, [r1, #1]!
b .loop2_int

This loop fails completely because words need a #4 offset, not #1. Same goes for the halfword loop. And since r2 is the byte count, not the wordcount, it executes 3x too many.

Code:

#ifdef __cplusplus //Non-mangled names plz
extern "C" {
#endif

extern void FastCopy(void* Src, void* Dest, u32 NumBytes, u32 Size);

#ifdef __cplusplus
}
#endif //__cplusplus

Use `const void*' for the source; this saves the users the trouble of having to cast away the constness of the data.

There are good reasons for some of that, actually.

Cearn wrote:

Code:

.global FastCopy
.type FastCopy,function
FastCopy:
stmfd sp!, {r3-r12, lr}
sub sp, sp, #4
str r3, [sp] @We need this later, but for now, r3 can assist in copying.

You're storing r3 twice here, for no apparent reason. Once is enough.

This is because I need to copy, then restore r3 and not all those other registers. I don't know of any way to push multiple registers and then only pop one, so I did push all, push r3, copy, pop r3, do some more, pop all.

Quote:

Code:

.loop:
cmp r2, #44
blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
b .loop

You're assuming r0 and r1 are word aligned, which would result in incorrect copying if they're not. They usually will be, but a general routine should check for misalignments.
Also, there is no need for two branches in the loop, one is enough. This goes for all the loops in the routine. The compare is also redundant, or at least could be taken out of the loop.

True, but I don't know what exactly I would do if they're not aligned. I don't see how I can remove the second branch, and the compare is how it knows when to stop (notice r2 is subtracted).

For the rest, it's mainly because I've never written ARM ASM before. >_>
_________________
I'm a PSP hacker now, but I still <3 DS.

HyperHacker wrote:

There are good reasons for some of that, actually.

Cearn wrote:

Code:

.global FastCopy
.type FastCopy,function
FastCopy:
stmfd sp!, {r3-r12, lr}
sub sp, sp, #4
str r3, [sp] @We need this later, but for now, r3 can assist in copying.

You're storing r3 twice here, for no apparent reason. Once is enough.

This is because I need to copy, then restore r3 and not all those other registers. I don't know of any way to push multiple registers and then only pop one, so I did push all, push r3, copy, pop r3, do some more, pop all.

stmfd will place the registers in the list in memory in the order of the register-indices, in your case r3 would have the lowest address (i.e., the one that sp will point to in the end), then r4 above that, then r5 etc. This should suffice:

Code:

FastCopy:
push {r3-r11,lr} @identical to stmfd sp!, {r3-r11,lr}

@ copy main blocks

pop {r3} @ pop r3 ( same as ldmfd sp!, {r3} )

@ copy smaller stuff

pop {r4-r11,lr} @ pop rest ( same as ldmfd sp!, {r4-r11,lr}
bx lr

Quote:

Code:

.loop:
cmp r2, #44
blt .part2 @by checking here, we can copy less than 44 bytes in one call (but it's slow)
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
sub r2, r2, #44 @copy 44 bytes at a time (!), and we need those flags updated
b .loop

You're assuming r0 and r1 are word aligned, which would result in incorrect copying if they're not. They usually will be, but a general routine should check for misalignments.
Also, there is no need for two branches in the loop, one is enough. This goes for all the loops in the routine. The compare is also redundant, or at least could be taken out of the loop.

True, but I don't know what exactly I would do if they're not aligned. I don't see how I can remove the second branch, and the compare is how it knows when to stop (notice r2 is subtracted).

If the both source and destination cannot be resolved to word alignment then you're pretty much screwed; you'd have to do it the slow way then. Although you could always unroll part of the loop I guess. As for the basic copy loop, they usually look something like this:

Code:

cmp r2, #4
bcc .Lcpy_end @ Don't copy if there's nothing to do
.Lcpy_loop:
ldr r3, [r0], #4 @ tmp= *r0++;
str r3, [r1], #4 @ *r0++ = tmp
subs r2, r2, #4 @ count -= 4
bhi .Lcpy_loop @ if(count>0) goto .Lcpy_loop
.Lcpy_end:
@stuff

The body of the loop needs only a load, a store, a subtract and a jump. That you have to make a compare outside the loop to test if it's worth going into the loop is of little concern, as it'll only be executed once anyway. In some cases you might even get away with

Code:

.Lcpy_loop:
subs r2, r2, #4 @ count -= 4
ldrcs r3, [r0], #4 @ if(count>=0) tmp= *r0++;
strcs r3, [r1], #4 @ if(count>=0) *r0++ = tmp
bhi .Lcpy_loop @ if(count>0) goto .Lcpy_loop

if you already know r2 won't be used later.

Quote:

For the rest, it's mainly because I've never written ARM ASM before. >_>

In that case, be even more careful :P
It's very easy to make mistakes in asm, especially when it concerns either alignment of data or off-by-one errors in loops. In fact, I think I may have just discovered one of those in my own FastCopy routine :P. Always check for potential breaks in code; this is true in C, and is doubly true in asm.

Thanx, guys. :) But here is a stupid question. How can i compile asm-arm code in devkitPro? :))

PS: _asm { } don't work too... Maybe some special compiler parameters?

If you want to do it in with your C++ code make sure the name of the file is one that tells gcc that you are doing C++ (Like .cpp ). If you want a stand alone file just name it (.s or .S). C as a language does not support inline assembly so putting the code in a C source file is problematic at best. There is a command line switch that you can use to force the type so the language parser does not need to guess as well. Sorry for not giving direct examples but my msys is broken right now because I switched to another laptop and tried to just copy it (what a fool). Hope this helps.

gmiller wrote:

If you want to do it in with your C++ code make sure the name of the file is one that tells gcc that you are doing C++ (Like .cpp ). If you want a stand alone file just name it (.s or .S). C as a language does not support inline assembly so putting the code in a C source file is problematic at best. There is a command line switch that you can use to force the type so the language parser does not need to guess as well. Sorry for not giving direct examples but my msys is broken right now because I switched to another laptop and tried to just copy it (what a fool). Hope this helps.

Yeah. Thanx. :) With .s files asm-code from message above compile properly. Makefile parameters (and compiler parameters) is:

asm.o: asm.s
arm-eabi-gcc -x assembler-with-cpp -c $< -o $@

Ok. But FastCopy works slower than Dma analog (!) with videomemory and don't working at all (!!!) with RAM memory (global arrays, i.e.).

Here is handmade profiler values =) (tested on DS hardware; functions make fullredraw (update 256*192 pixels))

FastCopy (RAM) - crash
FastCopy (VRAM) - 4.1 ms :(
dmaCopy (RAM) - 23.5 ms
dmaCopy (VRAM) - 3.1 ms
Copying through cycle - 13.9 ms
memcopy (RAM) - 8.8 ms
memcopy (VRAM) - 6.6 ms

And now i have investigated memset functions

memset (RAM) - 6.3 ms
memset (VRAM) - 4.4 ms
"dmaSet" (RAM) - 13.2 ms
"dmaSet" (VRAM) - 8.3 ms

And now one trouble. Dma analog of memset works slower than software memset! Ehh...

Here is my code for dmaset.

Code:

// global variable for dma
const uint32 dword_for_dmacopy[1] = { 0x00000000 };
...
...
DMA_SRC(3) = (uint32)dword_for_dmacopy;
DMA_DEST(3) = BG_BMP_RAM_SUB(0);
DMA_CR(3) = DMA_ENABLE | DMA_32_BIT | DMA_SRC_FIX | DMA_START_NOW | (256 * 192 / 2);
while(DMA_CR(3) & DMA_BUSY);

And now updated question. =) How we can write memset and memcopy for VRAM faster than 4.1 ms and 3.1 ms respectively?

Last edited by HotChilli on Sun Sep 03, 2006 7:25 pm; edited 1 time in total

You could just skip the wait and do other stuff there instead.
_________________
"no offense, but this is the gayest game ever"

Sausage Boy wrote:

You could just skip the wait and do other stuff there instead.

Yeah. That's work in particular case. But i.e. i write in bottom part of VRAM and next time dma erase my changes... That bugs me! :))

FYI, when you do this:
while(something);
You're wasting CPU power doing essentially nothing, which is hard on the batteries if you do it a lot. I think swiDelay() (or was it swiSleep()? haven't done any DS stuff recently) actually stops the CPU for a while, which would be much easier on the batteries:
while(something) swiDelay(5);
Of course this means it could take up to 5 units (not sure what measurement that is) before it realizes DMA is done, so you'd want to adjust that number depending how fast you need the entire routine to be.
_________________
I'm a PSP hacker now, but I still <3 DS.

HyperHacker wrote:

FYI, when you do this:
while(something);
You're wasting CPU power doing essentially nothing, which is hard on the batteries if you do it a lot. I think swiDelay() (or was it swiSleep()? haven't done any DS stuff recently) actually stops the CPU for a while, which would be much easier on the batteries:
while(something) swiDelay(5);
Of course this means it could take up to 5 units (not sure what measurement that is) before it realizes DMA is done, so you'd want to adjust that number depending how fast you need the entire routine to be.

Yeah. That's looks better. ;)

PS: Have you test your FastCopy function for different types of memory? VRAM? RAM? Is it only for VRAM?

I designed it for VRAM but I tested it with ordinary RAM too.
_________________
I'm a PSP hacker now, but I still <3 DS.

gbadev.org forum archive

DS development > Full screen redraw (effect-based) game - slow DMA copying!

#100850 - HotChilli - Thu Aug 31, 2006 3:00 am

#100884 - bjoerngiesler - Thu Aug 31, 2006 11:08 am

#100888 - melw - Thu Aug 31, 2006 12:44 pm

#100905 - josath - Thu Aug 31, 2006 5:04 pm

#101037 - HyperHacker - Fri Sep 01, 2006 9:17 pm

#101043 - Cearn - Fri Sep 01, 2006 9:55 pm

#101087 - HyperHacker - Sat Sep 02, 2006 7:25 am

#101092 - Cearn - Sat Sep 02, 2006 9:58 am

#101162 - HotChilli - Sun Sep 03, 2006 2:37 pm

#101167 - gmiller - Sun Sep 03, 2006 3:02 pm

#101189 - HotChilli - Sun Sep 03, 2006 7:16 pm

#101190 - Sausage Boy - Sun Sep 03, 2006 7:23 pm

#101191 - HotChilli - Sun Sep 03, 2006 7:30 pm

#101272 - HyperHacker - Mon Sep 04, 2006 3:08 am

#101455 - HotChilli - Tue Sep 05, 2006 6:20 pm

#101517 - HyperHacker - Wed Sep 06, 2006 6:34 am