gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > Which kind of number works the fastest on the DS?

#141252 - NeX - Sun Sep 23, 2007 4:10 pm

I am intending to do loads of calculations on small numbers (0-15). At the moment, I am using u8s. What is the fastest number the DS can calculate under LibNDS? I've heard INTs but I am not sure.
_________________
Strummer or Drummer?.
Or maybe you would rather play with sand? Sandscape is for you in that case.

#141257 - Peter - Sun Sep 23, 2007 4:29 pm

NeX wrote:
I am intending to do loads of calculations on small numbers (0-15). At the moment, I am using u8s. What is the fastest number the DS can calculate under LibNDS? I've heard INTs but I am not sure.

Once the value is loaded into a register, it's 32bit anyway. The "problem" with datatypes which are not 32bit in size is that the compiler must generate extra code to enforce the value-range.

For example an 8bit value can be 0..255, but since registers are 32bit, the compiler generates code to clear all bits except the lower 8:
Code:

void foobar(unsigned char value)
{
  // the compiler generates something like this
  // to make sure the value is in range 0..255
  value &= 255;
}

This can make your program run a little slower. This applies also if you load a value from memory. To "workaround" this behaviour, we can simply use a 32bit datatype as parameter and the extra generated instruction won't be there.

But it highly depends where your data is located. For example on the AGB, where the ROM bus is only 16bits wide, I gained a lot of performance to reduce some lookup-tables from 32bit to 16bit. If your data is located in RAM, 32bit variables can be faster.

However, this highly depends on the situation and there is no general rule of thumb from my experience. I usually check the generated assembler code and use a profiler to make optimizations, everything else is just like poking around in the dark.

Hope it helps
_________________
Kind Regards,
Peter

#141419 - jonezer4 - Tue Sep 25, 2007 4:57 am

Is using short ints faster or at least less memory intensive than ints and long ints?

#141439 - Cearn - Tue Sep 25, 2007 3:39 pm

jonezer4 wrote:
Is using short ints faster or at least less memory intensive than ints and long ints?

No. And yes. And maybe. It's ... complicated. 'int' and 'long' are equivalent datatypes by the way.

In rough terms, you have two types of data: stuff in registers and stuff in memory. Global variables, arrays, structs live in memory. Here the smaller types obviously use less memory. However, any actual work is done in registers, so before you can use memory-based data, it has to be loaded into a register first, operated on, and then perhaps stored back into memory again. For example:
Code:
global++;

What the CPU sees is this:
Code:
register= global;
register++;
global= register;


As Peter says, these registers are always 32-bit. And because they're 32-bit, every time you work on non-32-bit stuff, it has to be cut down to that size again. This is usually done by two extra shift instructions. For example:
Code:
u8 sum(u8 a[], u16 count)
{
    u16 ii, sum=0;
    for(ii=0; ii<count; ii++)
        sum += a[ii];

    return sum;
}

What actually happens this:
Code:
u8 sum(u8 a[], u16 count)
{
    u16 ii, sum=0;
    u32 tmp;

    for(ii=0; ii<count; )
    {
        tmp= a[ii];
        sum += tmp;
        sum <<= 16;
        sum >>= 16;

        ii++;
        ii <<= 16;
        ii >>= 16;
    }
}

The the local variables ii and sum are registers; local variables usually are. This means that for them to be non-ints (i.e., non-32-bit) variables, every action on them is followed by two shift instructions.

count is also already in a register, because the parameters of a function always are (technically only the first four are and later ones go on the stack (in memory), but before use, they'd have to be loaded back into registers anyway). As a side note, because the datatype of count is non-32-bit, you have two extra shifts for that as well before the function is called (unless it's a constant, in which case the compiler can tell if it fits u16 already; like I said, it's complicated).

Basically, using non-ints (bytes or halfwords, doesn't matter) for the local variables and the parameters means two extra instructions for every arithmetic operation done on them. I haven't tested this DS-wise, but I reckon the routine takes about 40%-50% longer purely because of not using ints for the local variables. In more complicated routines like color adjustments, that could very well go up to 100%. On the DS, it also has an extra problem: all these extra instructions clog up cache, making things slower still. For locals, always use int or u32.

As said, for memory-resident data, small types mean less memory. The only reason they may be faster is due to bus-sizes, but because the CPU has better support for loading/storing 32-bit stuff, you may lose that benefit again anyway.

#141536 - strager - Wed Sep 26, 2007 7:02 pm

Cearn wrote:
As said, for memory-resident data, small types mean less memory. The only reason they may be faster is due to bus-sizes, but because the CPU has better support for loading/storing 32-bit stuff, you may lose that benefit again anyway.


In addition/clarification, the ARM processors used by the DS (and GBA) cannot preform optimized array calculations as fast for halfword (short) arrays than for byte (char) or word (int) arrays. This is because for halfword access, the processor does not have the luxury of shifting the address offset, while for byte and word transfers it does. For example, to access one element of a halfword array, you must first shift the address left by one bit then read from memory. This takes just one 1S cycle; however, these cycles can really add up if a lot of indexes are being used (e.g. in a loop).

Otherwise, for all three data types (not including double-word), there is no need to mask the data, since this is automatically done by the processor. The only limit is the data bus.

For small LUTs in DTCM, I've noticed that changing "u16" to "u32" speeds up the functions that use the LUT. These LUTs were access many (thousands per second) times, and were not cached. I didn't intensively profile the change, however, so it may have been some other strange condition that sped the code up.

(I'm probably wrong somewhere (or everywhere). Don't take my word as fact...)