#154686 - Exophase - Sun Apr 20, 2008 8:19 am
Wondering if anyone knows this... posting here so it'll be seen/responded to, and you guys have an ARM7 to play with. This has actually been in the back of my mind for a really long time (if it belongs somewhere else more please move it though...)
If what I've read is to be believed, the ARM7 pipeline will convert Thumb instructions to ARM instructions during the second half of the decode stage of its pipeline. This is also something they supposedly changed in ARM9+ (so that Thumb instructions are decoded directly to whatever native internal formats the pipeline uses). Anyway, this should mean that for ARM7 every instruction in Thumb corresponds 1 to 1 to an ARM instruction.
Actually, there appear to be two instructions that do not have an ARM equivalent, although they're subtle. I'm referring to the instructions that I've seen called "bll" and "blh." You might not find them in proper documentation - what they are are the two halves of the 32bit bl instructions. But they don't have to actually be used consecutively. Golden Sun 2 even uses blh, which is probably why it ends up at unaligned PC's. But they're useful instructions:
bll offset: lr = pc + 4 + sign_extend_21(offset << 12)
blh offset: _pc = lr + offset * 2; lr = ((pc + 2) | 1); pc = _pc
So there you have it, the second one is practically an implemenation of the indirect blx instruction in Thumb (minus the mode switching), so long as you're okay with using lr implicitly. It even allows you to add an integer offset to it - if you use Thumb on ARM9 you could use this for a one instruction jump table lookup + call (if the code is in IWRAM located at the beginning somewhere). And they're not slow either: according to No$ bll takes 1S and blh takes 2S + 1N. That means blh is the same speed as all the usual branching instructions.
But for the most part Thumb sucks. So is it at all possible that there are ARM instructions that correspond to these? Undefined/reserved ones. Or maybe ARM7 really converts most instructions but handles these in a special way.
#154696 - masscat - Sun Apr 20, 2008 1:19 pm
Exophase wrote: |
Anyway, this should mean that for ARM7 every instruction in Thumb corresponds 1 to 1 to an ARM instruction. |
There is not a 1 to 1 mapping of Thumb instructions to ARM instructions. It is actually impossible to achieve as the value read from the PC is different when in Thumb (current instruction address + 4) and ARM mode (current instruction address + 8). So any instruction that reads the PC has no single equivalent in the other instruction set.
The branch instructions are described in section A7.1.17 of the ARM Architecture Reference Manual.
The pair of branch instructions simply allow you to encode a branch of about +-4MiB using 16bit instructions. Normally, if you were writing assembler then you would write a single BL or BLX instruction with the desired PC relative offset and the assembler will form the correct branch instruction pair for you.
Why do you care how an ARM core handles instruction decoding internally? I am not sure that it is defined in the reference documentation.
Why Thumb:
ARM7TDMI Reference Manual wrote: |
Thumb code is typically 65% of the size of ARM code, and provides 160% of the performance of ARM code when running from a 16-bit memory system. Thumb, therefore, makes the ARM7TDMI core ideally suited to embedded applications with restricted memory bandwidth, where code density and footprint is important.
The availability of both 16-bit Thumb and 32-bit ARM instruction sets gives designers the flexibility to emphasize performance or code size on a subroutine level, according to the requirements of their applications. For example, critical loops for applications such as fast interrupts and DSP algorithms can be coded using the full ARM instruction set then linked with Thumb code. |
#154715 - Exophase - Sun Apr 20, 2008 9:17 pm
masscat wrote: |
There is not a 1 to 1 mapping of Thumb instructions to ARM instructions. It is actually impossible to achieve as the value read from the PC is different when in Thumb (current instruction address + 4) and ARM mode (current instruction address + 8). So any instruction that reads the PC has no single equivalent in the other instruction set. |
Although I forgot to mention that (I know there was something else that didn't map, couldn't remember what..) this is not really the same kind of difference as these constituent instructions. The PC being different does not mean that the instruction itself has to do something differently with that PC, it just means that the end result can't be achieved. As for the halfword aligned shifts on the offsets, I could see that being a byproduct of something mode dependent later in the pipeline. It still does not mean, at least not to me, that a different kind of instruction is being done. I can easily see Thumb to ARM translation still applying to what you said but I can't see it applying to the two "half instructions" I gave. There's no ARM instruction where you could fiddle with the operands a little and get this kind of functionality.
masscat wrote: |
The branch instructions are described in section A7.1.17 of the ARM Architecture Reference Manual.
The pair of branch instructions simply allow you to encode a branch of about +-4MiB using 16bit instructions. Normally, if you were writing assembler then you would write a single BL or BLX instruction with the desired PC relative offset and the assembler will form the correct branch instruction pair for you. |
You're missing the point. Of course that's why they were written, but the two constituent instructions in isolation are actually highly useful in their own right, not just as a pair. There's a reason why Golden Sun 2 uses the "blh" instruction.
Even if there's no way to get an equivalent instruction in ARM (which I'm leaning towards) this is still useful information for Thumb programmers.
masscat wrote: |
Why do you care how an ARM core handles instruction decoding internally? I am not sure that it is defined in the reference documentation. |
(now I'm just repeating myself...) It's because the documents I have read have stated that in ARM7 Thumb instructions are converted to ARM ones during the decode stage. That should mean that, ignoring some modifications later done to the operands based on thumb bit, that there are hidden (reserved, or otherwise unavailable?) ARM instructions that do the equivalent of the bll/blh instructions. Instead, the truth is probably that not EVERY instruction is handled in the pipeline this way (but notice - no pipeline delays for bll/blh to be special cased for this), or my sources were just completely wrong.
And please give me a little credit, I know why Thumb was invented/used and never asked why -_-
#154726 - Maxxie - Mon Apr 21, 2008 12:26 am
For the bll the translation should be add lr,pc,#imm lsl #12
The blh, well i don't see a possible translation to an existing .arm opcode (alltho there are instructions that could be capable of it if the parameters would fit all into one opcode)
But the thing i am replying for:
How does blh or its possible translation allows a "one instruction jump table lookup + call"?
All i see there are arithmetric operation on the lr and pc registers, but no memory access beyond reading the instruction itself, which would be vital for a jump table lookup.
My choice in .s for that task would be the simple
[possibly push lr]
LDR lr,[(R)Base,ROffset]
BLX lr
[possibly pop lr]
In many cases (switch-case translation) the return address is known at compile time, thus LDR pc,[Base,Offset] leading to a chunk ending with the non linking branch with #imm pointing just behind the LDR instead of the usual BX lr
For a call to a dynamic function table in .c/.s intermix i'd create a naked helper function JumpTable(base,offset) only containing
LDR pc,[r0,r1]
and let the compiler decide whether lr has to be backed up for the caller (could have been stored for another BL already)
#154730 - kusma - Mon Apr 21, 2008 1:00 am
Exophase wrote: |
If what I've read is to be believed, the ARM7 pipeline will convert Thumb instructions to ARM instructions during the second half of the decode stage of its pipeline. |
This sounds very unlikely to me, as this would most likely add another pipeline-step. Where did you read this?
#154732 - Exophase - Mon Apr 21, 2008 1:20 am
Maxxie wrote: |
For the bll the translation should be add lr,pc,#imm lsl #12 |
The imm here has 11 bits, so you can't do that.
Maxxie wrote: |
The blh, well i don't see a possible translation to an existing .arm opcode (alltho there are instructions that could be capable of it if the parameters would fit all into one opcode) |
What instructions would be capable, what do you mean exactly?
Maxxie wrote: |
But the thing i am replying for:
How does blh or its possible translation allows a "one instruction jump table lookup + call"?
All i see there are arithmetric operation on the lr and pc registers, but no memory access beyond reading the instruction itself, which would be vital for a jump table lookup.
My choice in .s for that task would be the simple
[possibly push lr]
LDR lr,[(R)Base,ROffset]
BLX lr
[possibly pop lr] |
Remember, ARM7 doesn't have BLX. This is basically a replacement for it. But it allows an offset too, which I guess was figuring could be uesd as a table offset. Naturally it wouldn't do what you're referring to in one instruction though (definitely not memory accesses).
kusma wrote: |
This sounds very unlikely to me, as this would most likely add another pipeline-step. Where did you read this? |
Multiple things can happen in a pipeline stage (especially the third stage for the classic ARM pipeline). I don't have a lot of references for it right now, mostly something I've read before (that it's translated). A diagram in this PDF shows it though (page 9):
http://www.iti.uni-stuttgart.de/~rainer/Lehre/SoC99/PDF/19991207bw.pdf
Also:
http://www.cs.ucr.edu/~gupta/research/Publications/Comp/L14-krishnaswamy.pdf
(section 2.2)
#154764 - masscat - Mon Apr 21, 2008 1:27 pm
Fire the ARM instruction encodings marked as UNPREDICTABLE at an ARM core and see what they do to the register set.
I still do not understand why you want to do this (anything more than just interest). Even if you find an instruction encoding that has the equivalent behaviour it is not of much use as ARM could happily change it between core versions and even implementations of a particular core, as the encoding should never be seen from an external source.
#154765 - simonjhall - Mon Apr 21, 2008 1:51 pm
I could understand if it's part of an emulator (exophase, isn't it you who did a gba emulator?) as you wouldn't need to code and test a second instruction set. You could just pipe a THUMB instruction through to the relevant ARM function to save yourself some time.
I can't think of any other use - if this THUMB->ARM translation is not programmer visible then it doesn't really matter if a given implementation does it or not!
Out of curiosity, how do you know that ARM instructions aren't converted to THUMB? ;-)
_________________
Big thanks to everyone who donated for Quake2
#154791 - Exophase - Mon Apr 21, 2008 7:27 pm
masscat wrote: |
Fire the ARM instruction encodings marked as UNPREDICTABLE at an ARM core and see what they do to the register set.
I still do not understand why you want to do this (anything more than just interest). Even if you find an instruction encoding that has the equivalent behaviour it is not of much use as ARM could happily change it between core versions and even implementations of a particular core, as the encoding should never be seen from an external source. |
It is mostly just curiosity. However, even if it's just available on some cores that doesn't make it useless. For instance, if such a thing could be accessed on the ARM7TDMI in the DS then it could be slightly useful. If it could be accessed in the ARM920T on the GP2X it'd be slightly useful for me too. Homebrew code is already usually pretty platform specific and these particular instructions can be replaced with safe sequences that'll work on any ARM w/o destroying any additional registers.
Unfortunately I don't really have time to play around with something that's so unlikely :( I was kinda hoping someone would just know somehow ;)
simonjhall wrote: |
I could understand if it's part of an emulator (exophase, isn't it you who did a gba emulator?) as you wouldn't need to code and test a second instruction set. You could just pipe a THUMB instruction through to the relevant ARM function to save yourself some time. |
Although I do emulate ARM/Thumb separately I've worked on some Thumb to ARM translation code, with the main goal being to optimize the code in the process. Here I've just used some reserved bit spaces (which is how I know it can fit, at least somewhere), so it's not really a big deal. But of course this is what got me thinking about all this.
simonjhall wrote: |
I can't think of any other use - if this THUMB->ARM translation is not programmer visible then it doesn't really matter if a given implementation does it or not! |
If it's not possible to trigger it externally then it would be useless. But if the second part of the decode pipeline expects a real ARM instruction then that's what the Thumb expand stage would have to give it. Of course I doubt this is how it actually works, that's just how it has been described.
I could be totally off base with this, but I kinda hope that it might be useful to someone doing Thumb on ARM7 at least (which should actually still be a legitimate thing to do on DS)
simonjhall wrote: |
Out of curiosity, how do you know that ARM instructions aren't converted to THUMB? ;-) |
ARM was there first, plus Thumb is a tiny subset of ARM. It'd be great if that were somehow possible though (it almost is with Thumb-2, but there'd be no point in compressing ARM code on the fly just to have to decompress it again)
#154805 - Dwedit - Mon Apr 21, 2008 9:00 pm
Does the GBA implement the forbidden "never" mode of all instructions as nops? Haven't tested this, but I bet it does.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#154882 - masscat - Tue Apr 22, 2008 10:05 am
Dwedit wrote: |
Does the GBA implement the forbidden "never" mode of all instructions as nops? Haven't tested this, but I bet it does. |
ARM Tech Ref wrote: |
In ARMv4, any instruction with a condition field of 0b1111 is UNPREDICTABLE. |
So what they do is undefined. Some encodings may do nothing some might do stuff.
Another problem with using undefined instruction encodings is that you also need to hack the toolchain (compiler and assembler) to support them. A lot of work for very little gain.
#154885 - eKid - Tue Apr 22, 2008 11:47 am
Quote: |
Another problem with using undefined instruction encodings is that you also need to hack the toolchain (compiler and assembler) to support them. A lot of work for very little gain. |
Maybe you can use the assembler macros to define custom instructions.
#154997 - Maxxie - Wed Apr 23, 2008 9:08 pm
Could be interesting to test out how the CPUs react to the opcodes marked green or red in this map
I'm tempted to try them out and see which of them causes UND or anything else. They might be marked as unpredictable, but as long as N doesn't exchange the cores ...
#154999 - simonjhall - Wed Apr 23, 2008 9:19 pm
You could always write a program that automatically generated and ran instructions, and see which ones don't cause an undefined instruction exception?
50p says that there are some really juicy instructions in there...I'm hoping there are some fancy fp vector multiplying functions ;-)
_________________
Big thanks to everyone who donated for Quake2
#155005 - Maxxie - Wed Apr 23, 2008 10:04 pm
That's what i'm trying to do right now.
To automate this requires to recover from trapping UND and undocumented register changes, as well as possible data aborts and so on. I don't feel like manually restart for every n-th possible opcode.
#155010 - simonjhall - Wed Apr 23, 2008 10:21 pm
Shouldn't be too hard!
The simplest method I can think of off the top of my head would be: just install your own exception handler, and run your piece of code which runs your new generated instructions. If the exception handler is run the exact same number of times as 'new' instructions run, then that's a no-go! If it does differ (ie there are some instructions which don't generate an exception), binary chop your generated instructions until you find out what works.
Stick a pointer to your exception handler in at 0x27ffd9c and make it look something like this: (no idea if it compiles)
Code: |
exception_handler:
//store your stuff
ldr ip,=registers
stmia ip,{r0-r11,sp}
ldr ip,=registers
//increment your exception hit count
ldr ip,=exception_hit_count
ldr sp,[ip]
add sp, sp, #1
str sp,[ip]
/* if you want to call a real function, do this instead
ldr ip,=my_new_stack
ldr sp,[ip]
mov lr,pc
b my_exception_function
*/
//restore your registers
ldr ip,=registers
ldmia ip,{r0-r11,sp}
//now do the bios code (not my code btw)
ldmia sp!, {ip, lr}
mcr 15, 0, ip, cr1, cr0, 0
msr SPSR_fsxc, lr
ldmia sp!, {ip, lr}
subs pc, lr, #4 |
(this is effectively the core of my debugger btw, so should be legit...)
EDIT: missed out the store instruction
EDIT2: assigned a stack
_________________
Big thanks to everyone who donated for Quake2
Last edited by simonjhall on Thu Apr 24, 2008 7:44 am; edited 2 times in total
#155021 - Maxxie - Wed Apr 23, 2008 11:31 pm
On the first batch on the arm9:
The opcodes 0b1110 0000 01?? ???? ???? ???? 1001 ???? do all trap undef (256 KiInstructions)
Same with 0b1110 00001 1??? ???? ???? ???? 1001 ????
And 0b1110 0011 0000 ???? ???? ???? ???? ???? ???? (TEQ with s = 0)
Eureka!
There is one, a "ldrd" ( a v5e instruction) is working on the arm9: 0xE00F00D0 did load two instructions in the register pair r0:r1
[edit]Just noticed that oppsilon's site is dedicated to the nds cores, wonder if he did check them all or just took them out of the docu[/edit]
Possibly one of the opcodes before did too, but was not detected because i just noticed that it didn't check if the exception was really a undef or possibly an data abort.
Last edited by Maxxie on Thu Apr 24, 2008 1:36 am; edited 3 times in total
#155042 - masscat - Thu Apr 24, 2008 9:44 am
The ARM9 in the DS is a ARM946E-S which implements the ARMv5TE architecture. So it will have all the enhanced DSP instructions.
#155047 - simonjhall - Thu Apr 24, 2008 12:52 pm
So what's involved with "DSP instructions"? Anything cool?
_________________
Big thanks to everyone who donated for Quake2
#155048 - Maxxie - Thu Apr 24, 2008 1:29 pm
ldrd/strd - loading & storing doublewords (64 bit)
qadd/qdadd - saturated add
qsub/qdsub - saturated add
smul?? - 16&32 bit intermixed signed multiplications (with accumulation)
pld - preloading data
Did thought the arm9 was only subset of the arm946e-s (which seemed the nearest known core to me) well, now i know better :) thanks masscat
Think ill just try v6 etc instructions maybe there is a glimmer somewhere, but doubt that now.
#155049 - simonjhall - Thu Apr 24, 2008 1:46 pm
Do they function or just not cause exceptions? I never got any love from pld and I have doubts that it works...
_________________
Big thanks to everyone who donated for Quake2
#155050 - masscat - Thu Apr 24, 2008 1:53 pm
Maxxie wrote: |
Did thought the arm9 was only subset of the arm946e-s (which seemed the nearest known core to me) well, now i know better :) thanks masscat |
I do not know that as fact (was quoting gbatek and others). But I cannot see Nintendo/ARM having a core that does some of the ARMv5TE but not all as it would require custom tools and the like.
The Makefile used to build libnds and the one from the NDS examples from DKP sets the following gcc options:
Quote: |
-march=armv5te -mtune=arm946e-s |
The STRD and LRDD instructions are definitely supported (as you found) as I had to added support for them to desmume as the palib startup code uses them.
simonjhall:
It is perfectly valid for PLD to do bugger all:
Quote: |
It has no architecturally defined effect, and memory systems that do not support this optimization can ignore it. On such memory systems, PLD acts as a NOP. |
#155051 - eKid - Thu Apr 24, 2008 1:57 pm
I used pld once and it shaved off a couple % cpu :P. It's a bit hard to control though... I added it in another spot (which was a very similar operation) and it raised the cpu usage.
#155055 - simonjhall - Thu Apr 24, 2008 3:21 pm
Yeah that could be anything then I guess - better or worse usage of instruction cache?
@masscat, I think you're right about the tools and aww shucks about that quote from the ARM manual!
_________________
Big thanks to everyone who donated for Quake2