gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > short int vs. int inside ARM9

#175421 - sverx - Thu Nov 18, 2010 9:36 am

Not really ASM related but closely related to how ARM9/cache/NDS works.
Let me introduce the scenario: I have an array of, say, 100 structs. These structs contains mostly int variables, and some char.
Now, I know that some of the formers will contains values that could be easily stored in a short int, instead... and some even in a s8/u8. Of course doing so my struct will be likely to be 'smaller' and the array made up of them will require less RAM, and that's surely a pro, but I'm curious about speed, now.

According to GBATek, accessing an int or a short int that's already in the ARM9 cache requires the same amount of time (1 66MHz cycle), and accessing and int in main RAM requires just 1 additional 33MHz cycle compared to a short int (because the main RAM bus is only 16 bits wide so a sequential read is needed for the 2nd half word of an int)

What I don't know and I'd like to discuss with those of you who have a deep knowledge of how the ARM9 internals work it's: are there some downsides of having the processor work on halfwords? In short: are there any thing I should seriously consider before turning those ints into short ints? From what I've read there shouldn't be any serious cons but I'm almost a complete newbie about this arguments.

Thanks in advance :)

#175423 - Dwedit - Thu Nov 18, 2010 4:36 pm

Make sure your local variables are ints. Sometimes the compiler will do some bit masking to make sure that something assigned to a short is only 16-bits long. So shorts are okay for memory, since there is no masking done when loading/storing to them, but variables which live in registers may be unnecessarily clipped to make sure they are smaller than ints.

In ARM mode, Halfwords and signed bytes do have a slightly more crippled way of accessing them than ints or unsigned bytes. Immediate offsets from a base register must be within +/- 255 bytes from the base register, and when using an offset register, you can't specify a shift amount. When loading/storing words and bytes, you can use 12-bit immediate offsets (+/- 4095), and can specify shift values for offset registers. Shift Values for offset registers mean that you can have an array of ints (4 bytes), and use the index number directly (shift left 2 (*4) and add to array base)

But in THUMB mode, there is almost no difference in performance between loading/storing to halfwords vs ints, they both get crippled equally. Except for signed halfwords, there is no instruction for loading them with an immediate offset, like there is for the unsigned halfwords.

Most of the time, your C compiler is set to generate THUMB code, unless you've asked otherwise.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#175434 - sverx - Fri Nov 19, 2010 11:16 am

Thanks for your qualified answer :) So, if I understand correctly, there shouldn't be almost no slowdown switching that global array of struct to short ints as long as I keep local variables of int type.

What I'm thinking now is: should I expect also a speed increase, now? I mean... switching to short ints will result in a struct of smaller size, won't that have a chance to increase cache hit rate? The variables inside the struct would be 'closer' to each other so they'll more frequently end up in the same cache line (which is 32 bytes, afaik). Of course if the ARM9 cache work the way I think it works, which has of course some degree of uncertainty :rolleyes: ... well, if somebody knows where I can read how ARM9 cache work, I mean, what triggers a cache line read and what doesn't... well, I'm interested :)

Thanks! :)

#175440 - Miked0801 - Fri Nov 19, 2010 8:14 pm

Honestly, this level of optimization shouldn't even be on your radar for performance tuning. Use a profiler or some other sort of timing mechanism to see where your code is running slow. For a typical DS game, the biggest hits are:
1. 3D render/transform code.
2. Audio processing.
3. File System / Cartridge ROM reading access time.
4. Occasionally, some AI will hit the list if your game is very, very CPU intensive.
5. Decompression routines, if you heavily use compression.

A global array modification such your example might help speed in the order of 0.0001% overall unless you are doing nothing but accessing this field over and over again - in which case you should drop it into dtcm and have your code run in itcm.

#175444 - sverx - Mon Nov 22, 2010 11:40 am

Well, actually I'm not in the process of profiling any code... I'm just simply discussing with another programmer if a change like this would finally really have any cons, because we already know the pros.

In short: I don't need to speed up anything, just want to be sure that change doesn't slow down... and it seems to me it won't :)

edit: btw if the array so quite big that it won't entirely fit in the cache (4KB afair) if the struct is using ints and if it would fit when using shorts so I think in this case you'll have much more than a 0.0001% speed increase if you do a lot of work on that array...

#175509 - Miked0801 - Thu Dec 09, 2010 5:13 pm

Cache? Lol. I don't really know how much caching helps as the system doesn't let you bench mark it too easily, but even when moving 80-90% of our execution time code to tighly coupled memory, we didn't see too dramatic a speedup on execution. And for data, that's always a bit chancy with the internal threading the way it is.

#175510 - sverx - Thu Dec 09, 2010 5:28 pm

Well, I'm not an expert and I've never benchmarked the memory by myself (I wouldn't even know where to start! Suggestions? ;) ) but, according to these tables on GBATek, there should be a lot of difference between TCM/caches and main memory... well, of course the effect will be probably be amplified if both the code and the data are already in the cache or have been allocated in the TCM...

#175516 - Miked0801 - Fri Dec 10, 2010 5:00 pm

With a profiler of course - the problem being that profilers corrupt the memory results by just being there :)

Well off topic now. Mostly, my comment was to make sure that you weren't wasting time optimizing the wrong bit of code. I've seen and done it many times and wanted to make sure you weren't doing so as well.

#175517 - sverx - Fri Dec 10, 2010 5:20 pm

Miked0801 wrote:
Mostly, my comment was to make sure that you weren't wasting time optimizing the wrong bit of code. I've seen and done it many times and wanted to make sure you weren't doing so as well.


Thanks :) No, I'm not wasting any time optimizing any code, because I was just discussing with a fellow programmer about pro and cons of switching some global variables to short ints... but then the whole topic became an open discussion and I just like these things ;)

For instance now I'm wondering if having an array of structs that includes, say, an odd number of variables of different sizes would be a bad idea if compared to having a number of separate arrays of variables... hopefully you get what I mean. Accessing somevar[index] shouldn't be faster than accessing somestruct[index].somevar ...?

#175528 - Exophase - Tue Dec 14, 2010 8:39 pm

sverx wrote:
For instance now I'm wondering if having an array of structs that includes, say, an odd number of variables of different sizes would be a bad idea if compared to having a number of separate arrays of variables... hopefully you get what I mean. Accessing somevar[index] shouldn't be faster than accessing somestruct[index].somevar ...?


What you're describing is "struct of arrays" instead of "array of structs." What's faster depends on the underlying context. If you access several fields in the struct then it's preferable than the array approach.

That's for two reasons: because only one array pointer needs to be maintained instead of several, and because you're keeping everything closer together which maintains better spatial locality of reference for data cache.