gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

C/C++ > ARM from ROM outperforming Thumb?

#13426 - poslundc - Sat Dec 13, 2003 5:03 am

Well, now that I have my hands on a PC to do my debugging stuff and can actually access the various tools available in VBA for Windows (memory viewer, disassembler, etc.) I've come across a couple of surprises.

First of all, apparently ALL of the C code I've ever compiled was compiled as ARM code, not Thumb code. Apparently the -mthumb directive in my makefile was being ignored, probably because it came after the list of source files (I say this because it was fixed when I moved the -mthumb to before the source file list).

This struck a bit of a surprising chord with me, as you might imagine it would if you'd found out all of the GBA code you'd ever written was compiled to the wrong target. What's interesting is the difference when fixed, though: I have yet to do any actual profiling, but at first glance it doesn't seem that changing it to compile to Thumb code really makes much of a difference in terms of speed.

So now I'm thinking - yes, ARM instructions take two cycles to fetch, but the second fetch is always 0/1 waitstate (depending on prefetch), the compiler can operate much more intelligently with full access to twice the registers, and there is conditional execution for fewer branches (which cause non-sequential access to the ROM). Not to mention that ARM instructions are generally more powerful.

Also, because of the poor optimization and fewer registers I find that Thumb mode requires far more load/store instructions. These are not to ROM, however, so I don't know if they affect the sequential status of the system or not. (Anyone want to field this one?)

I'm not talking about hand-written assembly code here; I'm sure there would be no contest between carefully written Thumb code and ARM code when running from ROM. But in terms of what the compiler is capable of generating, has anyone experimented with how well one mode outperforms the other?

Dan.

#13437 - ampz - Sat Dec 13, 2003 12:42 pm

Are you testing this in VBA or on real hardware?
I'am pretty sure VBA is not cycle exact, so waitstates and 16bit busses does not matter when running VBA.

Yes, ARM has advantages over thumb, but you can execute twice the number of thumb instructions from 16bit ROM at a given period of time. In general, this gives thumb the edge.

#13456 - torne - Sat Dec 13, 2003 7:45 pm

ARM themselves quote Thumb code as being something like 130% the speed of ARM code with 16-bit memory. (and ARM code as being something like 145% the speed of Thumb code in 32-bit memory) I forget the exact numbers and the lecture notes I found them in are not available online (and seem to have been lost from my folder..)

The default wait state settings require a 2WS delay to fetch the second half, not 1WS, btw (though prefetch can still reduce it to 0). ampz, you can't quite run twice as many thumb instructions, because of the reduced penalty for sequential access.

You're right that hand-written Thumb code blows ARM code away in 16-bit memory; I write the large majority of my code in Thumb by hand and it's very fast. In writing my OS, I have yet to use a single local variable (i.e. the only memory accesses I make are to global state) - 8 registers has been plenty for everything. =)

Load/store to IWRAM is 0WS (1 cycle) for 32 bits, so that's not a huge issue; in many cases using a local var in IWRAM (the default, as local vars are normally on the stack) is just as fast as running slower-to-fetch ARM code which needs no local vars.

The problem is that GCC's Thumb code generation seems to need work. I can beat it most of the time with relatively little effort. If you use a recent compiler from ARM, the performance advantage of Thumb should be more apparent.

ampz is correct that VBA is not cycle-accurate; if you've not benchmarked on real hardware, then do so, as this might make a difference.

#13466 - poslundc - Sun Dec 14, 2003 12:46 am

torne wrote:
The problem is that GCC's Thumb code generation seems to need work. I can beat it most of the time with relatively little effort. If you use a recent compiler from ARM, the performance advantage of Thumb should be more apparent.


I think this is what I was really noticing. Being limited to the bottom registers for most of the instructions mean the compiler needs to be much more intelligent than it actually is to produce code that outperforms ARM.

Does anyone know whether or not memory accesses to non-ROM areas (IWRAM, EWRAM, I/O, etc.) affect the sequential access status for instructions being loaded in from ROM?

In that vein, if I moved Thumb code to EWRAM (which would be slower overall, but humour me here) or ran ARM code from IWRAM, would it make general sequential ROM accesses for data loading faster, since it doesn't need to lose its sequential positioning to load instructions from ROM?

Dan.

#13474 - ampz - Sun Dec 14, 2003 12:15 pm

poslundc wrote:
torne wrote:
The problem is that GCC's Thumb code generation seems to need work. I can beat it most of the time with relatively little effort. If you use a recent compiler from ARM, the performance advantage of Thumb should be more apparent.


I think this is what I was really noticing. Being limited to the bottom registers for most of the instructions mean the compiler needs to be much more intelligent than it actually is to produce code that outperforms ARM.

Does anyone know whether or not memory accesses to non-ROM areas (IWRAM, EWRAM, I/O, etc.) affect the sequential access status for instructions being loaded in from ROM?

In that vein, if I moved Thumb code to EWRAM (which would be slower overall, but humour me here) or ran ARM code from IWRAM, would it make general sequential ROM accesses for data loading faster, since it doesn't need to lose its sequential positioning to load instructions from ROM?

Dan.


You have still not answered our question, do you run on hardware or emulators?

RAM accesses most probably affect the sequential status of the ROM bus. However, it might stay sequential if you enable prefetch...

#13476 - poslundc - Sun Dec 14, 2003 3:58 pm

ampz wrote:
You have still not answered our question, do you run on hardware or emulators?


Both. :P But like I said, I haven't done precision timing on either one.

Dan.

#13511 - Paul Shirley - Mon Dec 15, 2003 3:33 pm

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 9:28 pm; edited 2 times in total

#13512 - ampz - Mon Dec 15, 2003 3:43 pm

I assume you have enabled full optimisation?
How about code size? Even if the compiler generates slow thumb code, it should at least generate more compact thumb code than ARM.

There are of course other ARM/thumb compilers out there.. Some of them will most likely generate better thumb code than GCC.


Last edited by ampz on Mon Dec 15, 2003 4:28 pm; edited 1 time in total

#13513 - Paul Shirley - Mon Dec 15, 2003 4:07 pm

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 9:28 pm; edited 1 time in total

#13514 - torne - Mon Dec 15, 2003 4:21 pm

Paul Shirley wrote:
Like I said, I noticed very little size change. Since the whole advantage of Thumb derives from fetching less words its obvious why there's little speed improvement ;)

The ARM mode compiler just optimises better. There is some advantage to Thumb, just not as much as expected (at least on my code with heavy inlining and maximised optimisation).


Don't inline code on the GBA for anything but the smallest, most critical functions, if at all. Inlining a lot balloons your code size, makes optimisation more difficult (due to there being more live registers) and doesn't increase speed significantly in most cases, as ARM branches are a very small penalty. If you disable all inlining you might find that Thumb is a size improvement.

The vast majority of GBA code that I've compiled is about 30% smaller in Thumb, but not really much faster if any.

#13515 - MumblyJoe - Mon Dec 15, 2003 4:48 pm

I am way under-qualified to comment on this, but my games run fine in arm or thumb (i dont use asm) however arm is far slower. Just thought you may want to know, davkitadv (and I assume gcc in general) seems far more thumb friendly.
_________________
www.hungrydeveloper.com
Version 2.0 now up - guaranteed at least 100% more pleasing!

#13516 - Paul Shirley - Mon Dec 15, 2003 5:21 pm

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 9:29 pm; edited 1 time in total

#13517 - torne - Mon Dec 15, 2003 6:45 pm

Paul Shirley wrote:
'more live registers' is a non problem, inlined code can have more aggressive live range analysis performed and removing call overhead cannot add to register use!


Paul, I must disagree. What you describe is an ideal world which rarely exists. While there are code samples that are both faster and smaller with aggressive inlining, there are far more that are neither. Removing function calls greatly affects register use; allowing registers to be pushed/popped at an entry and exit is often faster than inlining the code and causing a spill to local variables.

Quote:
What might not be obvious is that even massive chunks of inlined code can have dramatic effects on the optimiser. The best I've seen is inlining 500 lines in a 500 line caller resulted in code 75% as long as the original and nearly 2x faster (not code I wrote BTW). More visible context allows better optimisation, every time.


Such examples are extremely rare; I experimented with many different inlining setups on my current code and inlining any function I have with more than a few instructions in it makes code that is significantly slower. Also, quite a few functions are larger when inlined. (e.g. 35 instructions in caller, 22 instructions in callee; inlined version has 60 instructions, a 3 instruction increase).

Quote:
That means you should go for aggressive inlining until space becomes an issue.


This is not borne out by experimentation. I suggest you compare them carefully.

This is, incidentally, why I write almost everything in asm, and the assembler I'm working on will be able to optimise across functions, giving you the majority of the speed benefits of inlining without increasing your code size.

#13518 - Miked0801 - Mon Dec 15, 2003 7:08 pm

Quote:

This is, incidentally, why I write almost everything in asm, and the assembler I'm working on will be able to optimise across functions, giving you the majority of the speed benefits of inlining without increasing your code size.


At the same time increasing development time and chance of intruducing hard to find errors (YMMV). I agree in an ideal world, everything is done as assembler. But in the world I live in, I just don't have time to create assembly for all code. In my mind, it just doesn't make sense to write menu code, scrore keeping, or other code that is rarely called, or when it is, doesn't use more than 40% of the CPU anyways. In these cases, C is your friend.

BTW, I've found little difference myself in when code is inlined or not as any piece of code that it truely would make a difference with is already either compiled as ARM and in RAM, or already is in assembler.

Mike


Last edited by Miked0801 on Mon Dec 15, 2003 10:19 pm; edited 1 time in total

#13520 - torne - Mon Dec 15, 2003 9:12 pm

Miked0801 wrote:
At the same time increasing development time and chance of intruducing hard to find errors (YMMD).


I write C and asm at about the same speed (as they are both low-level languages), and rarely introduce any errors (in any language) which are not immediately caught by unit tests.

My main project at present is to develop a high-level assembler, which should allow me to develop and debug asm faster than I currently can in C (due to lack of suitable C development tools). It will also make it trivial to write selfdocumenting assembly code, eliminating what I consider to be C's only advantage over assembly (the ability to use arbitrary names for all 'things').

#13521 - Miked0801 - Mon Dec 15, 2003 10:25 pm

Self-documenting code to me means code that only the person that wrote it understands. :) I read self-documenting as code that is so clear that no comments are needed. Is this what you mean?

I've yet to see anything close to self-documenting assembly language. The Krawall stuff we use tries by #define ing the registers per function then pre-processing the file with the C++ compiler before assembling (which is pretty clever.) Still, with assembler, I tend to write as much in comments as I do in opcodes. It's the only way to instantly remember what exactly I was trying to accomplish months later. C code, I'll leave the occasional clearing loop, init code, etc., as is, but when doing anything else, I'll put a nice comment block in the code. I'm curious if this is the norm here or not?

Mike

#13526 - poslundc - Mon Dec 15, 2003 11:14 pm

I was actually having a discussion similar to this one earlier today... when it comes to C the ability to have meaningful variable names and constants goes a long way towards having code that is "self-documenting". Simple loops, straightforward algorithms, etc. document themselves by virtue of being able to read the lines of code as English statements to see what they do. It's when the purpose of those statements go beyond their literal interpretation that you start to need additional inline documentation.

Assembler is a different matter for me; I comment almost every single line in assembler and keep a register map on hand that translates between register numbers and their variables (or general purpose). This is just as much for myself reading my own code as anyone else; I wouldn't have a clue what my code was doing otherwise. :)

BTW, #define macros are a feature of ANSI C and (as a result of that) the gcc compiler; not C++.

Dan.

#13527 - torne - Mon Dec 15, 2003 11:55 pm

Miked0801 wrote:
Self-documenting code to me means code that only the person that wrote it understands. :) I read self-documenting as code that is so clear that no comments are needed. Is this what you mean?


Yes. All my code in any language attempts to be self-documenting; for assembly, this is currently not possible due to lack of language support. (so I'm fixing the language).

It should be trivial to write self-documenting assembly for my high-level assembler, just as it is trivial in C(++), Java, Smalltalk, Forth..etc. The Krawall code you describe is likely to be poorly factored; it is unfortunate that the most comprehensible design (many functions with a single responsibility each) is not always the highest performance in asm. The HLA will not have this problem, as it will be capable of cross-function optimisation and thus you can split your code up as much as you like.

#13542 - ampz - Tue Dec 16, 2003 11:54 am

torne wrote:
I write C and asm at about the same speed (as they are both low-level languages), and rarely introduce any errors (in any language) which are not immediately caught by unit tests.

C is certainly not a low level language.
torne wrote:
My main project at present is to develop a high-level assembler, which should allow me to develop and debug asm faster than I currently can in C (due to lack of suitable C development tools). It will also make it trivial to write selfdocumenting assembly code, eliminating what I consider to be C's only advantage over assembly (the ability to use arbitrary names for all 'things').

"high-level assembler" That's almost a contradiction in terms...

#13543 - tom - Tue Dec 16, 2003 12:08 pm

C certainly "operates" on a lower level than other programming languages. Some people even call C a high level assembler.

#13548 - poslundc - Tue Dec 16, 2003 4:26 pm

Come on now, at best that makes it a low-level language, but a high-level assembler? I doubt it. It's still a whole world of abstraction away from ASM.

I do not think that any code that is machine-independent is capable of being considered any form of assembly language. It's simply a contradiction in terms.

Dan.

#13549 - tom - Tue Dec 16, 2003 4:32 pm

dear dan, i didn't say *i* call it a high level assembler =)
but in my opinion c is only a thin (compared to other languages, anyway) abstraction over assembler, and not "whole worlds away".

#13553 - torne - Tue Dec 16, 2003 5:54 pm

I consider C to be a low-level language because its speed of development, and the size of its functional units, are comparable to a good RISC assembly. It's a pretty small layer of abstraction. A clearly high level language is something like Smalltalk (which has no primitive data types at all).

My high-level assembler is high-level only by comparison to ordinary assembler; it's not intended to be a high-level language. It should end up with similar capabilities for abstraction to C, but without losing platform dependence and the ability to directly control instruction sequence, and will hopefully let me write HLA code as fast as I can write Java/Smalltalk (much faster than C) by giving me tool support for refactoring.