#930 - torne - Sun Jan 12, 2003 3:02 pm
Does anyone know anywhere online (gba related or not) where I can find tips for effective Thumb programming? I'm writing a lot of asm and am using Thumb in order to save copying stuff to iwram (as my code is pretty big and getting bigger). I keep running into things that I'm sure can be written more efficiently, however.. and was just curious if anyone had any general tips to achieve things that are nontrivial due to Thumb limitations (like the '8 bit constants, no free shifts' issue *grin*)
Thanks,
Torne
#945 - pulstar_3 - Sun Jan 12, 2003 6:23 pm
Hi,
If your working in 32 Bit areas of Memory(IWRAM), than ARM code would be a lot more effective, and in any 16 Bit area of memory Thumb code would be more effective. The difference is about 40%. Thumb codes are half the size of ARM code so if you want a small binary it would best to use mainly thumb code, but you need more instructions in Thumb code than you would using ARM code.
Hope it helped just a little bit.
#957 - Splam - Sun Jan 12, 2003 7:46 pm
I've found that it's usually just as good to do arm code in rom if you can't relocate it iwram for memory reasons. The gains you get from the more complex instructions usually outweigh (or equal) the loss of having to read the 32bits instead of 16 per instruction. Thumb should certainly be used if what you want to write is possible using less instructions. All depends on how much time you want to spend on it all to get the fastest code possible under your ram limitations.
#963 - torne - Sun Jan 12, 2003 8:45 pm
pulstar: I'm not worried about code size, I just need to reduce wram usage to the minimum. I'm only writing critical parts of the code in ARM, and that's fine.. I was looking for tips on how to get the most out of the rest, in Thumb.
splam: not for me. Static timing analysis on all the arm/thumb I've written gives me Thumb being about 35% or more faster when running direct from gamepak - I've written a lot of routines in both. I'm looking for tips on how to push that 35% up, since ARM claim I should be able to be 60% faster =)
I'm quite prepared to slave over it.. I've already spent ages making a static timing analyser (not ready for release yet) and come up with macros that switch between synthesizing register values using arithmetic and loading them from the pool depending on which is faster for that value. =)
T.
#965 - Splam - Sun Jan 12, 2003 9:26 pm
@torne
Was that code you've written in thumb asm or compiled from C? If you've "hand crafted" it then yeah you'll see an increase with thumb, I presumed (for some unkown reason) you were compiling C hehe Not sure about 60% though, I suppose it all depends on what the situation is, I reckon thats a best case scenario and therefore 35% is probably about right.
#995 - torne - Mon Jan 13, 2003 2:29 am
Yeah, I'm writing assembler myself (freestanding, not inline). There isn't enough info about asm programming available in the gba dev sites, that I can see, and almost no information about Thumb. I think I'm doing ok for now, though.. just fishing for more useful stuff.
T.
#1064 - tepples - Tue Jan 14, 2003 7:32 am
torne wrote: |
pulstar: I'm not worried about code size, I just need to reduce wram usage to the minimum. |
Make overlays. If you have several ARM routines you need to execute, but they don't depend on one another (e.g. an audio mixer and a polygon rasterizer), then copy one block of code from ROM to IWRAM, run it, copy another block into IWRAM, run it, etc. Programmers with experience from the Atari Jaguar console will be all too familiar with this technique.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#1090 - torne - Tue Jan 14, 2003 2:44 pm
Thanks Tepples, but I'm already using IWRAM overlays for my ARM routines and it works nicely (crtls.zip is your friend). I'm writing a big system, though, and in order to leave space for the app threads I need to make as much code run direct from gamepak as possible.. hence wanting more thumb tips.
Can I presume that not very many people around here actually write thumb code? =)
T.
#1206 - dooby6 - Wed Jan 15, 2003 6:39 pm
Hello torne,
From what I have read in the ARM documents each Thumb instruction is just an encoded version of an ARM equvlient such as:
Pop = ldmfd sp!,{reg list}
push = stmfd
lsl = mov reg1,reg2,lsl const(reg3)
So wouldn't the most the programming tips from ARM convert to thumb as well.
Dooby6
#1263 - torne - Thu Jan 16, 2003 1:02 am
Thumb has a whole lot of additional restrictions, Dooby.. like only 8 bit immediates can be specified as you cannot shift as a side effect of an instruction. Only registers 0-7 can be accessed by arithmetic instructions, and arguments of most instructions are constrained to be registers or very short (5 bit or less) immediates. There are
This leads to the first Thumb tip I've discovered: values like the GBA's IO registers can be encoded as mov r0, #0x40; lsl #20; add r0,#0x02 (encodes REG_DISPSTAT in r0) which although appearing verbose, is faster than lda r0, =0x4000002 (if your constant pool is also in gamepak rom since a 32 bit read from the gamepak takes 15 cycles). I ended up writing a macro that will pick the quickest way to encode any immediate under the Thumb restrictions, if anyone's interested I'll post it. (it detects number of 8 bit immediates required to synthesize the value)
T.
#1295 - zeuhl - Thu Jan 16, 2003 2:41 pm
In that case, I guess accessing REG_DISPSTAT would be even more effective with the following code :
Code: |
mov r0,#0x4
lsl r0,#24
strh r1,[r0,#2] @store something into REG_DISPSTAT
|
this way r0 will contain 0x04000000, and the strh instruction will store the contents of r1 into 0x04000002. further accesses to vectors with close addresses (0x04000000, 0x04000004 etc.) will be done easily by changing the #2 offset to whatever you need.
#1304 - torne - Thu Jan 16, 2003 5:22 pm
Yep, that's what my macro does. Here it is in all its glory *snigger*:
Code: |
.macro REGT r i @ automatically chooses fastest
VALT \r, (0x4000000 + \i)
.endm
.macro VALT r i
.if ( \i == (\i & 0xFF) )
mov \r, #\i
.elseif ( \i == (\i & 0xFFFF) )
mov \r, #(\i >> 8)
lsl \r, #8
add \r, #(\i & 0xFF)
.else
VALTLOOP \r, \i, 0x80000000, 24
.endif
.endm
.macro VALTLOOP r i t s
.if ( \i & \t )
VALTNEG \r, \i, (\t >> 8), \s
.else
VALTLOOP \r, \i, (\t >> 1), (\s-1)
.endif
.endm
.macro VALTNEG r i t s
.if ( \t == 0x80 )
mov \r, #(\i >> \s)
lsl \r, #\s
.if ( \i & 0xFF )
add \r, #(\i & 0xFF)
.endif
.elseif ( \i & \t )
ldr \r, =\i
.else
VALTNEG \r, \i, (\t >> 1), \s
.endif
.endm
|
This is GAS assembler for Thumb only. The first macro, REGT, is just a shorthand for adding 0x4000000 to the address given and is used like this: (assuming you have suitable constants defined using cpp or similar)
REGT r0, REG_DMA0CNT_L @ loads address of dma0 control into r0
The real work is done by VALT, which can also be used directly to load any value. Example:
VALT r1, 0x8004 @ loads 0x8004 into r1, somehow
The other two macros are utility macros and should not be called directly.
VALT will:
* use a mov with an immediate if the value fits into the low 8 bits
* use a mov followed by a shift if the value has no more set bits than can be covered by an 8 bit value (this includes non-byte-aligned, like 0x1FE)
* use a mov followed by a shift followed by an add if the value is like the above, but with the low 8 bits also set to something - e.g. most IO registers. This takes 12 cycles (assuming it being entirely in rom) and uses 48 bits of ROM.
* For all other values, use a load from constant pool - which takes 15 cycles and uses 48 bits of ROM.
This all assumes that:
* you've not changed the cart waitstates (since that will change the timings and maybe change some decisions about 'best options')
* your code and your literal pool are all in cart ROM
* you're actually programming in thumb in the first place *grin*
I've tested this extensively so if it breaks please let me know.
Update: It broke. It failed for values >= 0x80000000 due to signedness in GAS comparisons. I fixed it above.
Torne
#1342 - torne - Thu Jan 16, 2003 9:19 pm
Actually I missed the point of zehul's post completely. My macro doesn't do that, just a lot of other clever tricks. Yes, that's pretty good, especially when several nearby registers are going to be accessed. I'd have to do it manually, though.. invoking the macro is quick and simple and is guarenteed to get the right result (now that I fixed it anyway). If I'm running out of vblank I might start writing store offsets...
T.