gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > Thumb instructions on Nintendo DS?

#49244 - sajiimori - Wed Jul 27, 2005 7:23 pm

split from a topic about Lua

Don't use thumb at all on DS.

#49288 - ChronoDK - Thu Jul 28, 2005 7:27 am

I got it working without errors now, but is still using thumb - why should I not compile for thumb-interworking sajimori?

I am only testing on iDeaS, so there is no guarentee it is really working...

#49310 - wintermute - Thu Jul 28, 2005 12:35 pm

sajiimori wrote:
Don't use thumb at all on DS.


There's absolutely no reason why you shouldn't use thumb on DS. The main RAM is 16bit which gives ARM code a slight disadvantage as per GBA.

#49326 - sajiimori - Thu Jul 28, 2005 6:15 pm

Inner loops will be cached on the CPU after a single iteration so ARM will be faster.

#49328 - Mighty Max - Thu Jul 28, 2005 6:33 pm

sajiimori wrote:
Inner loops will be cached on the CPU after a single iteration so ARM will be faster.


Thats not quite the point of using thumb instructions.

The abilities of thumb are a bit limited (parameter as well as opcode wise) and therefor requires less resources, that might result in less cycles per operation, and/or used logical gates.

Thumb has the advantage of:
- One cycle of opcode read in 16bit mem environment still matters when reading code without loops that fits into the cache.
- The double amount of instructions can be cached and predicted.
- power saving
- memory saving through shorter opcodes
It might as well increase the used pipes, but ... i don't know how N designed the cores.

The big disadvantage is mainly the reduced instruction set and parameter size, so that you probably need more instruction for the same computation.

#49355 - ector - Fri Jul 29, 2005 1:24 am

Mighty Max wrote:
sajiimori wrote:
Inner loops will be cached on the CPU after a single iteration so ARM will be faster.


Thats not quite the point of using thumb instructions.

The abilities of thumb are a bit limited (parameter as well as opcode wise) and therefor requires less resources, that might result in less cycles per operation, and/or used logical gates.


This is definitely wrong. THUMB is implemented as a very simple instruction unpacker that expands the THUMB instructions out to the corresponding full ARM instruction before feeding into the CPU.

#49358 - sajiimori - Fri Jul 29, 2005 1:43 am

Yes. I must warn readers that Max's post consisted of speculation, backwards thinking, and misinformation.

Wintermute was overstating by saying there's no reason not to use Thumb, but at least he had his facts straight.

#49383 - Mighty Max - Fri Jul 29, 2005 7:11 am

I have to correct myself is the "It might" part as that was real specualtion, the rest is as it is.

I might misread ARM's architecture documents, but afaik the translation of the 16 to 32 bit opcodes are not done precache, but in the pipe. That leaves the cache and read arguments valid.
Arm.com wrote:
A "Thumb-aware" core is a standard ARM processor fitted with a Thumb decompressor in the instruction pipeline.


As the thumb still is only a subset of ARM the point of less resource usage and power saving stays valid, may i quote arm itself again
Quote:
Thumb offers the designer
    [...]
  • Industry-leading MIPS/Watt for maximum battery life and RISC performance


And i don't think that the arm core designers did not optimize it that if an opcode can't have the full range of variations (a translated thumb instruction) uses all the components and steps a full range instruction would do. I.e. generally missing adds within multiply instruction




PS: sajiimori: If you have doubts or problems with my posts, please address ME. I am not a third imaginary person here and can (unlike you?) make errors and have the ability to correct me in this case. Thanks

#49423 - sajiimori - Fri Jul 29, 2005 6:05 pm

Yeah, you're right. I tried to sum up the things that were wrong with your post, but I couldn't find anything that really captured it. The whole thing seems to be based on this idea that since you can have twice the number of instructions, it must be faster. If you'd actually check some benchmarks, you'd know better.

#49432 - tepples - Fri Jul 29, 2005 6:50 pm

Each ARM instruction would have to do the work of at least two Thumb instructions in order for ARM to be faster than Thumb from the same 16-bit memory. This is likely only if you're doing a lot of shifting and/or a lot of conditional execution.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#49440 - sajiimori - Fri Jul 29, 2005 7:18 pm

Right, but you only run from 16 bit memory on the first pass through a loop (unless your inner loop is over 8K in which case you might consider refactoring).

#49479 - dXtr - Sat Jul 30, 2005 1:19 am

I think I see a thumb war coming up ^^
[Images not permitted - Click here to view it]

#49481 - sajiimori - Sat Jul 30, 2005 1:50 am

Deku will save us all! I have faith. :P

#49484 - DekuTree64 - Sat Jul 30, 2005 2:58 am

sajiimori wrote:
Deku will save us all! I have faith. :P

*pops out from shadows*
Fine, fine :P
There's NO difference in speed between a decoded THUMB instruction and the equivalent ARM instruction.

I don't know about power consumption, but THUMB would probably win most of the time because of less memory reads on a 16-bit bus.

On the DS though, ARM might win because you can get more work done with instruction, and once they're in the cache, you have 32-bit reads. I don't know how cache compares to main RAM in terms of power, but I'd guess it takes less since it's part of the CPU itself.


THUMB is still good for non loop-heavy game code that won't get much benefit from the cache, and doesn't need to be fast anyway. No point in wasting main RAM on ARM code where it's not necessary.

Max's comment about fitting twice as many instructions into the cache is valid, but since it generally takes more THUMB instructions to do the same amount of work, and generally an inner loop will fit entirely in the cache even as ARM code, ARM will win.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#49607 - torne - Sun Jul 31, 2005 9:40 pm

Just a few points to add:

1) Other than the reduced memory accesses, running in Thumb definately takes just as much power as running in ARM, if you are doing the same operations. The execution phase of the pipeline is not aware that the original was a Thumb instruction as the output from decode is the same whether the source is ARM or Thumb.

2) Thumb code generated by GCC is often slower than ARM code even taking 16-bit memory into account, even on platforms where the cache doesn't exist. This is because GCC is, well, crap at generating Thumb code. ARM's RVCT (RealView Compiler Toolkit) does substantially better, but is still not ideal - many segments of Thumb code produced by either GCC or RVCT can trivially be improved by anyone with a little assembly knowledge.

2a) Incidentally, yes, you can compile NDS or GBA stuff with RVCT if you have, well, paid the huge fees ARM demand for it. My work laptop has RVCT installed with a nodelock licence, so I tried it. You may need to write your own scatterload files as the GNU ld linkscript format is not compatible with armlink, but you can use the crt0.o from ndslib ok (though you can't *assemble* it with armasm because the syntax is different to GNU as). The resulting code is faster and smaller, in either ARM or Thumb mode. Oh, except where it's broken. My favourite RVCT bug is the way it ignores 'volatile' on long long quantities...

So, well, you might find it's faster to build everything as ARM anyway, and depend on the cache. Or, ideally, do explicit preloads into ITCM when you're not busy doing something else, say, during the vfill inbetween doing line params and mixing sounds.

#50614 - Ethos - Thu Aug 11, 2005 3:35 am

Hmmm....not buying the ARM is better argument.

I wrote memcpy assembly routines that are extensively called over and over in both arm and thumb...and the thumb always wins speed wise.

Maybe I am missing the point...but I still see validity in programming in thumb.
_________________
Ethos' Homepage (Demos/NDS 3D Tutorial)