#38325 - ymalik - Sat Mar 26, 2005 4:42 am
Which is faster and which should you use when?
Thanks,
Yasir
#38326 - tepples - Sat Mar 26, 2005 5:20 am
Both have some overhead.
Use CpuFastSet() if you're clearing memory (as opposed to copying). Use CpuFastSet() if you have interrupts (e.g. sound, serial, hblank) that will need servicing before the copy completes, as DMA blocks interrupts until it finishes. Otherwise, feel free to use DMA 3 for memory-to-memory copies.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#38450 - Miked0801 - Mon Mar 28, 2005 12:24 am
Use neither for copies of under ~32 bytes - a structure to structure copy is faster in these cases in Thumb and up to about 64 bytes in ARM.
#38468 - Steve++ - Mon Mar 28, 2005 5:20 am
I was about to start a new topic, then I realised this one is probably about the same (or similar) thing.
I've been using CpuFastSet for ROM-to-VRAM tile copies (64 bytes). I'm assuming this uses LDMIA/STMIA instructions. What I want to do is write some ARM asm code for copying a row of tiles at a time. To make this fast, I'll be using LDMIA/STMIA directly. This is my question: How many cycles does it take to copy 64 bytes from ROM to VRAM using LDMIA/STMIA (not including time to setup the registers)? I've never been able to find an answer to this, whether searching the forums or trying to figure it out from the available docs (which are vague on cartridge wait states). I'm asking this question because I'm about to write a tile engine that can load up to 2816 tiles per frame and I'm wondering if it's feasible.
#38489 - ymalik - Mon Mar 28, 2005 3:31 pm
Did you check out what I've come up regarding the scrolling the massive image?
#38510 - Miked0801 - Mon Mar 28, 2005 7:05 pm
To Steve++
Here's what no$gba reports for various memory location timings (working off of 32 bytes - double this plus a few cycles overhead for 64 if you don't unroll)
Copies
Slow to Fast 59 cycles
Fast to Slow 59 cycles
Fast to Fast 19 cycles - 32 bytes
Slow to Slow 99 cycles - 32 bytes
Rom to Fast 45 cycles
Rom to Slow 85 cycles
Fast to VRAM 27 cycles
Slow to VRAM 67 cycles
ROM to VRAM 53 cycles
Read from ROM 39 cycles
Read from Slow RAM 50 cycles
Read for Fast RAM 10 cycles
Write to VRAM 17 cycles
Write to Slow RAM 49 cycles
Write to Fast RAM 9 cycles
So ROM to VRAM of 64 bytes would be 53x2 + about 5 cycles overhead for loop check or 0 cycles if you unroll it - about 106 cycles best case for 64 bytes. Also, there will be an occasional cycle added when accessing VRAM - but how much this affects things I can't tell you.
Edit: What edit - did I edit something stupd away ;)
Last edited by Miked0801 on Wed Mar 30, 2005 6:48 pm; edited 1 time in total
#38511 - poslundc - Mon Mar 28, 2005 7:09 pm
Steve++ wrote: |
I've been using CpuFastSet for ROM-to-VRAM tile copies (64 bytes). I'm assuming this uses LDMIA/STMIA instructions. What I want to do is write some ARM asm code for copying a row of tiles at a time. To make this fast, I'll be using LDMIA/STMIA directly. This is my question: How many cycles does it take to copy 64 bytes from ROM to VRAM using LDMIA/STMIA (not including time to setup the registers)? I've never been able to find an answer to this, whether searching the forums or trying to figure it out from the available docs (which are vague on cartridge wait states). I'm asking this question because I'm about to write a tile engine that can load up to 2816 tiles per frame and I'm wondering if it's feasible. |
Part of the problem is that the timing for ROM accesses can change depending on the setting in the WAITCNT register. According to the bible, most modern cartridges can support 3/1 cycle timing, and most games will use this setting. The default setting when the GBA turns on, however, is always 4/2.
Let's say you use LDM to load in 8 registers from ROM. A LDM instruction takes nS + 1N + 1I:
- n = 8 since you are transferring 8 words
- S is the cycles to transfer a 32-bit sequential word, keeping in mind it must be split into two sequential 16-bit accesses (16-bit bus width on the ROM)
- N is the cycles to transfer a 32-bit non-sequential word, with the second access of the two being sequential
- I is an internal cycle (takes one clock cycle).
If you are using the default WAITCNT:
S = (2 + 1) + (2 + 1) = 6
N = (4 + 1) + (2 + 1) = 8
Total cycles: 57
If you are using the 3/1 setting for WAITCNT:
S = (1 + 1) + (1 + 1) = 4
N = (3 + 1) + (1 + 1) = 6
Total cycles: 39
Then to store the 8 registers to VRAM with an STM, timing is (n - 1)S + 2N. VRAM has a waitstate of 0/0 but the bus is 16-bit, so you are still required to perform two accesses in order to load/store 32 bits.
S = (0 + 1) + (0 + 1) = 2
N = (0 + 1) + (0 + 1) = 2
Total cycles: 18
So there you have it: a LDM/STM burst using 8 registers from ROM to VRAM will by default take 75 cycles, or 57 cycles if the 3/1 waitstate is being used.
I've done these calculations based on the ARM docs and what's in GBATEK; I leave it to my peers to correct any mistakes or misinterpretations I may have made.
Dan.
EDIT: Mike posted first... hm... his numbers seem roughly consistent with mine, so I am reasonably satisfied with them.
#38584 - Steve++ - Tue Mar 29, 2005 9:23 am
Thanks guys. I calculated how many cycles 60 frames would consume when using DMA and it was about 22 million. This is a lot better. Assuming 57 cycles, it will take 9.6 million cycles each second - well within available CPU time.
I wouldn't imagine the overhead will be too great. I'll write an assembler routine that loads a whole row (32 tiles) without any branching. I'm also interested in encoding rows of tiles as actual asm code. But that's probably better left for when I have a lot of time on my hands.
Thanks again.
#38608 - Cearn - Tue Mar 29, 2005 4:01 pm
Miked0801 wrote: |
Fast to VRAM 67 cycles
Slow to VRAM 27 cycles
ROM to VRAM 53 cycles
|
Errr, that should be ROM 67, Slow 53 and Fast 27, right?
I've been doing a lot of speed testing recently and can't get near these figures, what exactly is the code that you're using? Something like eight-fold stmia/ldmia ARM asm in iwram, right? If so, that might explain why the figures are so different from mine. In C, compiled with -mthumb -O2 I can only get around 112 for EWRAM code, and perhaps 75 cycles in IWRAM, for 8x u32 Slow->VRAM copies.
#38625 - Miked0801 - Tue Mar 29, 2005 7:17 pm
Nope, I've got my figures correct (as far as I can test) - ROM access (when set correctly) is 3/1/1 and EWRAM is 2/2/2 so it will have an extra wait per 4 bytes over ROM (after the first 2 registers)
And yep, just a simple ARM in IWRAM loop of:
Code: |
ldmia [r0]!,r2-r9
stmia [r1]!,r2-r9
|
where r0 and r1 are source and dest pointers and you of course have safely pushed/popped everything before hand.
For Thumb, it will be quite a bit slower. The best I've seen compilers do is using 3 registers for the copy:
Code: |
ldmia [r0]!,r2-r5
stmia [r1]!,r2-r5
|
Which means you'll get the 3 cycle start overhead of ROM access 6 times. Still, with the wait states set correctly, thumb reading/writing to VRAM would only skip up to around 60 cycles. Check your assembly output and see exactly what sort of lameness it is doing for you. I bet it's doing something silly like:
Code: |
foo:
ldr r2,[r0],#4 # 6 cycles
str r2,[r1],#4 # 10 cycles
sub r3,r3,#1 # 2 cyles
cmp r3,#0 # 2 cycles
bne foo # 8 cycles first 7 times, 2 the last
|
Keep in mind you get ROM waits when running code from ROM...
#38630 - DekuTree64 - Tue Mar 29, 2005 7:31 pm
Cearn wrote: |
Miked0801 wrote: |
Fast to VRAM 67 cycles
Slow to VRAM 27 cycles
ROM to VRAM 53 cycles
|
Errr, that should be ROM 67, Slow 53 and Fast 27, right? |
I think ROM is correct here, just slow and fast RAM are swapped. You can count it up in the cycles for reading/writing alone.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#38773 - Cearn - Thu Mar 31, 2005 4:56 pm
Miked0801 wrote: |
Nope, I've got my figures correct (as far as I can test) - ROM access (when set correctly) is 3/1/1 and EWRAM is 2/2/2 so it will have an extra wait per 4 bytes over ROM (after the first 2 registers)
And yep, just a simple ARM in IWRAM loop of:
Code: | ldmia [r0]!,r2-r9
stmia [r1]!,r2-r9
|
where r0 and r1 are source and dest pointers and you of course have safely pushed/popped everything before hand.
|
Redid the tests in ARM asm manually and got 69 for the Slow->VRAM case; allowing for setting up the timers this is close enough.
DekuTree64 wrote: |
I think ROM is correct here, just slow and fast RAM are swapped.
|
Never really just with REG_WAITCNT before, so I guess I was just used to seeing ROM being a little slower than EWRAM. OK thanks all for the clarifications.
Miked0801 wrote: |
For Thumb, it will be quite a bit slower. The best I've seen compilers do is using 3 registers for the copy:
Code: |
ldmia [r0]!,r2-r5
stmia [r1]!,r2-r5
|
|
r2, r3, r4, r5. That's 4 registers :p (sorry, sorry). Compiled ARM code does use up to four but that's about it, yeah.
Miked0801 wrote: |
Check your assembly output and see exactly what sort of lameness it is doing for you. I bet it's doing something silly like:
Code: |
foo:
ldr r2,[r0],#4 # 6 cycles
str r2,[r1],#4 # 10 cycles
sub r3,r3,#1 # 2 cyles
cmp r3,#0 # 2 cycles
bne foo # 8 cycles first 7 times, 2 the last
|
|
That's only moderately lame. Now this is lame:
Code: |
#define vid_mem ((u16*)0x06000000)
u32 *src= (u32*)someBitmap;
u32 *dst= (u32*)vid_mem;
for(ii=0; ii<19200; ii++)
dst[ii]= src[ii];
// is compiled (-O2 or -O3, -mthumb) into
@ r0= counter, r1= counter too, r2= dst, r3= data, r4 = src
.L6:
mov r3, #192 @ - 0x06000000
lsl r3, r3, #19 @ /
add r2, r1, r3
ldr r3, [r1, r4]
sub r0, r0, #1
str r3, [r2]
add r1, r1, #4
cmp r0, #0
bne .L6
|
Apart from the superfluous adds and cmp, retrieval of the VRAM address goes inside the loop. This will happen whenever you have an address as a #define or a constant pointer. In contrast, a while countdown gives much better results
Code: |
int nn= 19200;
u32 *src= (u32*)someBitmap;
u32 *dst= (u32*)vid_mem;
while(nn--)
*dst++ = *src++;
// is compiled (-O2 or -O3, -mthumb) into
@ r0= dst, r2= nn, r3= data, r4= src
.L17:
ldmia r4!, {r3}
stmia r0!, {r3}
sub r2, r2, #1
bcs .L17
|
I think this is the best you could hope for concerning u32 copies and roughly 50% faster than the awful for-loop.
#38776 - Miked0801 - Thu Mar 31, 2005 6:37 pm
Quote: |
That's 4 registers :p (sorry, sorry).
|
This is what I get for seat of the pants assembly coding :)
And yeah, I've found for loops to not be as efficient as while() loops. For whatever reason it likes to do a sub and a cmp. Your example is typical GCC lameness. Most of the time, it doesn't hurt (much), but in key areas, it really helps to review the assembler just to be sure. Great discussion!
#38785 - poslundc - Thu Mar 31, 2005 8:24 pm
Using decrementing while loops instead of for loops is an optimization technique that's both ancient and misunderstood, because it usually ignores the Pareto principle. Use the loop structure that makes your code the easiest to read/understand/verify. Use decrementing while loops for speed in the 20% of your code where it matters, but since loops usually form the container for the 20% or so, it isn't often.
Dan.
#38786 - ymalik - Thu Mar 31, 2005 8:36 pm
I still do not understand why people say that decrementing/incrementing pointers in a while loop is better than going through an array using a for loop. I know little assembly, but when moving pointers, you are still incrementing a pointer, which takes time, and then taking the value at that pointer. However, I remember from programming MC68HC11, there was a single instruction that allowed you to add an offset to an array and load that value into a register.
#38787 - Mucca - Thu Mar 31, 2005 8:46 pm
Who said anything about incrementing/decrementing pointers?
The point is, with count-down whiles, the comparison with zero can be optimized away as a part of the sub command, (http://www.arm.com/pdfs/DAI0034A_efficient_c.pdf). Whether GCC does this, Im not sure.
edit: added link, the one on devrs is broken
#38796 - poslundc - Thu Mar 31, 2005 9:29 pm
ymalik wrote: |
I still do not understand why people say that decrementing/incrementing pointers in a while loop is better than going through an array using a for loop. |
As Mucca pointed out, this isn't what's actually being discussed. But in general, the compiler is more efficient at dealing with pointers than arrays both because it doesn't have to worry about indexing into the array every time a reference is made, and because the compiler is more likely to recognize and assign a register if you are working with a single pointer variable.
Generally when iterating over an array of a complicated struct, assigning a pointer is both a helpful hint to the compiler and can make your code easier to read. eg.
Code: |
typdef struct SomeStruct
{
int a, b, c;
}
SomeStruct;
void myFunction()
{
SomeStruct myArray[20];
int i;
for (i = 0; i < 20; i++)
{
SomeStruct *s = &myArray[i];
s->a = i;
s->b = i * 5;
s->c = i * 32;
}
} |
Dan.
#38820 - Miked0801 - Fri Apr 01, 2005 1:56 am
Count-down for loops 'Should' be optimized as well, but our version doesn't for some reason - instead it outputs
Code: |
...copy
sub r0,r0,#1
cmp r0, #0
bne foo
|
which is just plain stupid.
#38825 - poslundc - Fri Apr 01, 2005 2:02 am
Any chance it's outputting Thumb code? :D
Dan.
#38828 - Miked0801 - Fri Apr 01, 2005 2:03 am
Funny thing is it does it in both ARM and Thumb. That's what we get for using GCC 2.9.5 :)
#38905 - Mucca - Fri Apr 01, 2005 9:33 pm
By the way Mike, is there a particular reason why you haven't switched to a newer version of GCC, albeit unsupported by Ninti? I only tried once very briefly and wound up with a 16MB bin from a 1MB elf that didnt work, although under 2.9.5 everything was peachy. Just wondering if there's some compelling reason not to use the latest compiler before I spend time switching over.
#38926 - Miked0801 - Sat Apr 02, 2005 2:40 am
On one project, I got 3.3.2 up and running mid-stream as an experiment, but had to change too many things to do it safely. All zero-length arrays are now array[] instead of array[0], objcopy flags changed drastically, and all sorts of funny warnings came into existance that needed to be hunted and killed. I also didn't see that signifigant of an improvement in code quality, though it did compiler quicker (and the latest has support for pre-compiled headers.)
#38931 - tepples - Sat Apr 02, 2005 3:50 am
Miked0801 wrote: |
I got 3.3.2 up and running mid-stream as an experiment, but had to change too many things to do it safely. All zero-length arrays are now array[] instead of array[0] |
This is an issue of compliance with newer versions of the ANSI C standard.
Quote: |
objcopy flags changed drastically |
Do you mean for the various bin2o hacks? Have you tried .incbin from assembly?
Quote: |
and all sorts of funny warnings came into existance that needed to be hunted and killed. |
Again, more ANSI C compliance issues.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#38974 - Miked0801 - Sat Apr 02, 2005 6:45 pm
Yep - but so many things that I would have needed to change the data build process at every level - not worth it for a platform that is starting to show age...