gbadev.org forum archive

I was troling these forums and a lot of times people would suggest doing

Code:

int* ptr;
int i = 6;//Or what ever number

(ptr + i);

instead of

Code:

ptr[i];

because it is faster. How is this different when it is converted to assembly?
_________________
help me

Is it faster? I'm not so sure, but I'm not one of the people who checks the ASM output...

If it is faster, it's probably ptr[i] being computed to the size of ptr's type, where ptr + i is straight addition. Why a compiler wouldn't optimize that out is beyond me.
_________________
- nanou

ptr[i] is purely syntax sugar for *(ptr+i). There is absolutely no performance difference.

I thought just as well too.

Just curious about it because I remember seeing in different threads with some people suggesting to do it as I stated previously because it was faster.

I thought maybe there's something I don't know about there
_________________
help me

protip: Run GCC with the -S and -o somefile.s switches to see the ASM code.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Going back, I think I misunderstood something.

This is what I meant:

Why is pointer indexing faster than array indexing?

I read this in the Graphics section of the forums about 2D bitmap drawing optimization.
_________________
help me

sajiimori wrote:

ptr[i] is purely syntax sugar for *(ptr+i). There is absolutely no performance difference.

I'm not sure, i think :

Code:

int ptr[10];

int j=ptr[4];
<=>
int j=*(ptr+sizeof(int)*4);

I'll take a shot at your second question:

Pointer indexing 'can' be faster than array indexing, depending on what you are doing. Our ARM compilers tend to make pointer indexing use stm/ldm combos more often and more efficient than array indexing. Tend to, not always, not even usually. As such, the following should not supersede readability of code. If you really, really need something to go faster, then you comment like crazy and/or use asm.

Here's 4 ways to copy an int foo[10] array to int newFoo[10]:

Code:

1.
for(int i=0;i<10;i++)
{
newFoo[i] = foo[i];
}

2.
for(int i=0;i<10;i++)
{
*newFoo++ = *foo++;
}

3.
memCopy(newFoo, foo, sizeof(foo));

4.
typedef struct
{
int pad[10];
} Generic40Bytes;

*((Generic40Bytes *)newFoo) = *((Generic40Bytes *)foo);

I'm willing to bet that way 4 would compile to the fastest code, most of the time, followed by a tie between 1 & 2, with 3 coming in last - for this very small amount of data to be copied.

If the same type of copy was done with 4000 bytes, memCopy would probably move up to first or second depending on just how bloated way 4 becomes (and the cache missing that causes.)

or use DMA for sizes in that range and larger :)

I would have expected someone to mention that [EDIT: premature] optimization is the root of all evil by now >_>.
Anyway ...

brave_orakio wrote:

Why is pointer indexing faster than array indexing?

I read this in the Graphics section of the forums about 2D bitmap drawing optimization.

What they may have been referring to is not just pointers vs arrays, but loop-reversal as well:

Code:

// incrementing for loop
for(i=0; i<size; i++)
dst[i] = src[i];

// decrementing while loop
while(size--)
*dst++ = *src++;

If this were converted directly into assembly, the second version would be faster because subtracting and comparing to zero can be done in one instruction rather than two. However, whether it actually compile to faster assembly will depend instruction set (ARM vs Thumb), data type and the exact formulation of your statements.

It also depends on what mood the compiler is in today. Sometimes it will convert the for,++ loop into the functional equivalent of the while,-- ; and sometimes it will do the reverse. It's nearly impossible to predict when it does what. In Miked0801's cases, for example, cases 1 and 2 may produce the same code, so may 3 and 4. Then again, maybe not.

Bottom line, don't worry about it too much.

What will make a difference, however, is how you handle memory data (global variables or member variables): whether your code uses that data directly or if you load them up into local variables first. For example:

Code:

typedef struct Surface
{
u16 *data;
u16 width;
u16 height;
} Surface;

// Two 16bit blits from src to dst.

// Dereference members in loops
void blitA(Surface *dst, int x, int y, Surface *src)
{
int ix, iy;

for(iy=0; iy<src->height; iy++)
{
   for(ix=0; ix<src->width; ix++)
      dst->data[(y+iy)*dst->width+(x+ix)]= src->data[iy*src->width+x];
}
}

// Load into locals first
void blitB(Surface *dst, int x, int y, Surface *src)
{
int ix, iy;
int srcWidth= src->width, srcHeight= src->height;
int dstWidth= dst->width;
u16 *srcData= src->data, *dstData= &dst->data[y*dstWidth+x];

for(iy=0; iy<srcHeight; iy++)
   for(ix=0; ix<srcWidth; ix++)
      dstData[iy*dstWidth+ix]= srcData[iy*srcWidth+ix];
}

In blitA(), all the members of src and dst are read inside the main loop, and y*width+x is calculated there too. In blitB(), that's all taken care of beforehand and the multiplies are even converted to increments that are only applied in the outer loop. As a result, blitB is roughly twice as fast as blitA. (Of course, for large transfers of contingent memory, you should be using a dedicated copier anyway (DMA or assembly routine)).

Ludo6431 wrote:

sajiimori wrote:

ptr[i] is purely syntax sugar for *(ptr+i). There is absolutely no performance difference.

I'm not sure, i think :

Code:

int ptr[10];

int j=ptr[4];
<=>
int j=*(ptr+sizeof(int)*4);

In pointer arithmetic, the unit for additions and subtractions is already the element-size. The sizeof(int) is not necessary.

Miked0801 wrote:

If the same type of copy was done with 4000 bytes, memCopy would probably move up to first or second depending on just how bloated way 4 becomes (and the cache missing that causes.)

It will probably depend on the compiler, but GCC will convert struct-copies larger than 13*4 bytes into memcpy calls.

Last edited by Cearn on Sat Dec 06, 2008 10:12 am; edited 1 time in total

If "pSomeStruct++" is setting the pointer to the next instance (or overflowing ;^p ) inside an array of the type of some structure or even a basic type, it doesn't really matter, with respect to the structure's size, then "pSomeStruct += 4" is doing the same, only skipping a few instances inbetween.

So, "pSomeData + 4" is equivalent to "pSomeData[4]".

If you cast your data to, let's say char, then those statements will skip the amount of bytes (4 in this example).
_________________
July 5th 08: "Volumetric Shadow Demo" 1.6.0 (final) source released
June 5th 08: "Zombie NDS" WIP released!
It's all on my page, just click WWW below.

Ludo6431, you're right that your examples aren't equivalent, but not for the reasons you think.

*(ptr+sizeof(int)*4) == *(ptr+16) == ptr[16].

Cearn, optimization isn't the root of all evil... but maybe we already agree about that.

silent_code, "pSomeData + 4" isn't equivalent to "pSomeData[4]". It's equivalent to "&pSomeData[4]".

sajiimori wrote:

Cearn, optimization isn't the root of all evil... but maybe we already agree about that.

Gah. Premature optimization. What I meant to say was "premature optimization is the root of all evil". And turning array-indexing into pointer-indexing is probably a example of that.

Wow a lot of replies! Thank you very much, as usual you guys are very helpful!
I guess the bottom line is, the compiler can be very moody and may mess with any c code optimization I do!
For large data copying, yes I do use a dedicated copier(DMA). I guess it is more useful for small data copying when DMA or any other dedicated copier would have too large an overhead.
_________________
help me

People generally don't like to use DMA as much on the NDS because DMA can not access the cache, preferring memcpy.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

That can actually be a good thing: a block transfer (like a memcpy) reads a range of memory only once, so caching generally doesn't help much... but large CPU transfers do end up kicking everything else out of the cache, which leads to additional cache misses after the transfer finishes.

A good strategy for substantial copies is to call DC_StoreAll (or the homebrew equivalent, whatever it's called) and use DMA. Then the existing cache contents won't be kicked out for nothing. The stores were probably going to happen anyway, eventually, so there's usually not much loss there.

(Oh, and you have to invalidate the destination range, if it's in main RAM. If you're doing a large DMA to main RAM, you may as well flush the whole cache anyway, since invalidating large ranges is slow, so the benefits of DMA are partially lost.)

Bah, stupid DMA. #1 source of funky, what the hell bugs
1. I changed to DMA, why is it slower? (Cache invalidating was painful.)
2. Every blue moon, my DMA goes crazy and copies stuff to our OAM buffer and beyond? (Same DMA channel was being used in regular and interrupt code.)
3. Multiplayer code is dropping packets like mad? (DMAs too long)

Bleh. Stupid, broken crap.

Oh yeah, here's another fun WTFDMA:

Step 1: Use DMA to upload a really long display list to the FIFO. (Perhaps you happen to be right around VCount=190...)

Step 2: In VBlank, use DMA to upload textures to VRAM. (Must wait for the existing DMA to finish...)

Step 3: Check out awesome black lines at the top of the screen from where the video mode is still in LCDC, because your display list stole a bunch of VBlank time away from your texture transfers.

Miked0801 wrote:

Multiplayer code is dropping packets like mad? (DMAs too long)

are you sure that DMA is the problem with dropping packets? Are you using FIFO for the wifi sync functions or are you using a shared memory mailslot?

1. He's an official dev so he uses what he's given.
2. In any case DMA stalls the CPU so it increases the chance of missing interrupts.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

From your comments, it seems that DMA is not the preferred method for copying in the NDS.
I thought that interrupts have priority over DMA in the GBA though? or only for the lesser DMA channels?
_________________
help me

Nope, active DMA stalls the CPU on both GBA and DS.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

I wouldn't say it stalls the CPU outright, but DMA can certainly block the CPU if it's constantly accessing main RAM and there's a cache miss.

If the DMA isn't constantly accessing main RAM (e.g. when uploading a display list, where it spends a lot of time waiting for the geometry hardware), or if there are no cache misses (e.g. if the surrounding code is all cached or in ITCM), the CPU can continue executing.

Anyway, there's no simple answer as far as which way is faster/better, but CPU copies are absolutely easier to get right.

Beginners should just use the CPU, no doubt. As for experts... well, they'll just eliminate the need for the copy entirely, won't they? ;)

(I'm only half-joking; we have very few substantial copy operations in our engine.)

But for GBA it stalls it outright because there is no cache on the GBA right?
Just curious though, for GBA development, professionals do use the DMA right? I remember mike saying that they used it for large sprite copying in Spyro.
For NDS, do you make your own copier? hand coded assembly probably?
_________________
help me

DMA is very useful if you're copying data that you aren't working on with the CPU, or when you need to copy at a certain timing, like at the start of HBlank.

On the GBA, you used DMA when:
1. You needed automated, short burst copies (ala HDMA) for scanline tricks.
2. You weren't doing multiplayer code and needed large copies as fast as possible.
3. Copying OAMs in VBlank.

We pretty much ignored general purpose DMAs on games with multiplayer due to late interrupt packet loss. Heck, even Digimon Racing, a Mode 7 game with 4-player multiplayer, experienced noticable slowdown due to packet loss from short HDMAs.

On DS (so far)
In theory, DMAs can help loading from the cart. In reality, it stalls nearly immediately unless you have it set up absolultely perfectly. OAMs still use it, and I used it to copy large movie BGs another recent game - with some small issues...

I guess I'm ok to use DMA since I wont be running multiplayer code and my primary development is for GBA.
I use it for copying images to VRAM and of course OAM copy in Vblank.
_________________
help me

To answer your other questions, brave_orakio:

DMA stalls the GBA unless you're running code from IWRAM.

For NDS, like I said, we mostly try to eliminate the need to copy anything in the first place, but we use DMA for uploading geometry to the FIFO, for OAMs, and for VRAM transfers.

Miked0801 wrote:

On DS (so far)
In theory, DMAs can help loading from the cart.

Only in official games. In homebrew, the DLDI spec requires drivers to use PIO, not DMA.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

And DMA in official land does next to no good on cartloads anyways. Bleh. Stupid slow interface.

Miked0801 wrote:

Bleh. Stupid slow interface.

Just be glad it's roughly an order of magnitude faster than the 300 KiB per second typical of a PlayStation 1 CD drive or a Games n' Music microSD interface.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

Yeah, I worked on PS1 titles as well back in the day. Stupid dev units were so poor, you had to turn them in their side's just so they wouldn't overheat. I expect slow loads from discs, not from 'cart' based systems.

We're still living a life of comparative luxury: we get to load models and sfx on the frame that we need them, and we get to continuously stream all character animations.

I know, but I sure enjoyed the relative firehose of a pipeline we got on the GBA.

But the memory on PS1 is much greater than on the GBA, so you could load plenty of frames.
Of course, maybe the images are 32-bits per pixel, so that eats a lot more memory?
_________________
help me

brave_orakio wrote:

But the memory on PS1 is much greater than on the GBA, so you could load plenty of frames.
Of course, maybe the images are 32-bits per pixel, so that eats a lot more memory?

True that the PS1 has more memory on the unit itself vs the GBA's memory on unit, but the cartridges practically act like memory themselves when comparing read speeds of the 2 pieces of hardware, so 32MBs?

Love the way off-topics come about. :P
_________________
DS - It's all about DiscoStew

About loading "plenty of frames" given larger memory:

Taking my latest DS project as an example, the amount of crap we load during the course of a typical level greatly exceeds RAM capacity.

If we had to deal with slow media (i.e. anything that spins), we would've had to choose between either cutting down the quantity of assets to a small fraction of what they are, or predicting when assets will be needed so we could start loading them well ahead of time (but not before we're able to unload something else).

DiscoStew wrote:

Love the way off-topics come about. :P

hahaha! indeed, I started the topic asking about array and pointer indexing, and it slowly evolved into data copies! I'm not complaining though, the insight that I get from each post is great!

Yikes! I'm pretty thankful that I don't have to think about things like that when developing for GBA.
_________________
help me

gbadev.org forum archive

C/C++ > Why is this considered faster?

#164982 - brave_orakio - Fri Dec 05, 2008 3:13 am

#164984 - nanou - Fri Dec 05, 2008 4:30 am

#164985 - sajiimori - Fri Dec 05, 2008 4:35 am

#164988 - brave_orakio - Fri Dec 05, 2008 5:39 am

#164990 - Dwedit - Fri Dec 05, 2008 8:49 am

#164992 - brave_orakio - Fri Dec 05, 2008 10:06 am

#164993 - Ludo6431 - Fri Dec 05, 2008 1:03 pm

#165001 - Miked0801 - Fri Dec 05, 2008 6:19 pm

#165004 - Cearn - Fri Dec 05, 2008 7:46 pm

#165006 - silent_code - Fri Dec 05, 2008 10:07 pm

#165012 - sajiimori - Sat Dec 06, 2008 1:17 am

#165018 - Cearn - Sat Dec 06, 2008 10:11 am

#165034 - brave_orakio - Sun Dec 07, 2008 1:51 am

#165038 - Dwedit - Sun Dec 07, 2008 10:22 am

#165050 - sajiimori - Sun Dec 07, 2008 9:46 pm

#165069 - Miked0801 - Mon Dec 08, 2008 8:24 pm

#165073 - sajiimori - Mon Dec 08, 2008 11:02 pm

#165074 - elhobbs - Tue Dec 09, 2008 1:47 am

#165075 - wintermute - Tue Dec 09, 2008 4:53 am

#165083 - brave_orakio - Tue Dec 09, 2008 9:14 am

#165089 - wintermute - Tue Dec 09, 2008 5:24 pm

#165106 - sajiimori - Tue Dec 09, 2008 10:54 pm

#165121 - brave_orakio - Wed Dec 10, 2008 8:59 am

#165123 - eKid - Wed Dec 10, 2008 9:25 am

#165131 - Miked0801 - Wed Dec 10, 2008 5:58 pm

#165156 - brave_orakio - Thu Dec 11, 2008 2:45 am

#165163 - sajiimori - Thu Dec 11, 2008 7:33 am

#165320 - tepples - Wed Dec 17, 2008 5:52 am

#165324 - Miked0801 - Wed Dec 17, 2008 6:10 am

#165328 - tepples - Wed Dec 17, 2008 6:19 am

#165352 - Miked0801 - Wed Dec 17, 2008 9:00 pm

#165361 - sajiimori - Thu Dec 18, 2008 1:25 am

#165362 - Miked0801 - Thu Dec 18, 2008 2:38 am

#165368 - brave_orakio - Thu Dec 18, 2008 7:48 am

#165369 - DiscoStew - Thu Dec 18, 2008 8:17 am

#165371 - sajiimori - Thu Dec 18, 2008 9:40 am

#165373 - brave_orakio - Thu Dec 18, 2008 10:15 am