#129847 - Ant6n - Mon May 28, 2007 5:56 am
Hi,
I am currently exploring the possibility of the emulation of some old architecture on the DS. I have a couple of questions:
1) In some assemblers i have seen an org directives, i.e. saying org 0100h puts the statement following the directive at 0100h; does that exist on devitpro gas?
2) I intend to put the whole cpu interpreter in the itcm, put the emulated devices' memory in the main ram or vram, and keep the data cache for the memory. If i put ".section .itcm,"ax", %progbits ", and compile with the newest devkitpro, will that accomplish that, or do i have to dwelve into the docs for that ominous cp15?
3) MIPS has some cool branch option where after a branch the following instruction(s) would still be executed. One could run some useful instruction in the branch delay and cut down the effective cycles it takes to branch; is there something similar on ARM, to cut down on the 3 cycle branch within itcm?
4) I understand the itcm has a waitstate of 0, meaning a 1 cycle load if there is no interlook. Does anybody know the cycles for the other memory areas (i.e. vram, shared wram)?
thx
anton
#129850 - keldon - Mon May 28, 2007 8:47 am
3: the arm has the same feature
4: check out gbatek; tells ye all that needeth be known in thee seven se..... (must watch pirates of the Caribbean)
#129854 - kusma - Mon May 28, 2007 9:43 am
keldon wrote: |
3: the arm has the same feature |
Uhm, what?! I'm pretty sure this is not true... What the instruction you mean support executing the next couple of instruction (after the branch-instruction as in MIPS) while flushing the pipeline?
#129855 - keldon - Mon May 28, 2007 9:51 am
kusma wrote: |
keldon wrote: | 3: the arm has the same feature |
Uhm, what?! I'm pretty sure this is not true... What the instruction you mean support executing the next couple of instruction (after the branch-instruction as in MIPS) while flushing the pipeline? |
Oh, I completely misread that and immediately thought of execute (with condition).
#129975 - Miked0801 - Tue May 29, 2007 5:30 pm
ARM has conditional execution - not branch delay.
Meaning quick if/else assign type checks don't branch per se.
cmp r1,0
movgt r0,0
movle r0,1
so instead of having to branch around the compare, it just executes straight through with the 'false' condition only taking 1 cycle instead of 3. It makes for better caching as well.
#129999 - Ant6n - Tue May 29, 2007 11:56 pm
Quote: |
4: check out gbatek... |
yarr, gbatek only gives waitstates for the GBA, i have searched through the whole document and found nothing (except a note that vram gets an extra waitcycle every 6 cycles - does that mean vram has usually 0 waitstate?).
Quote: |
ARM has conditional execution... |
I am aware of these, but they might not be vey useful to me if i 'switch' on 256 opcodes, and 256 "mod reg r/m"'s. also, i am trying to get a layout where the cpu flags don't get overwritten by the emulator so that the flags in cpsr correspond to the flags of my emulated cpu.
My idea was to split the itcm up into equal sized blocks, so that i can do a
Code: |
jump ((opcode << 6)+opcode start)
|
and
Code: |
branch'n'link ((modregrm << 6)+modregrm start)
|
In the end the branch operations would take a significant amount of time. I didnt think it'd exist, but i wanted to ask.
This jumping model is also why I ask about the "org" directive, to be able to declare labels that start a specific point in itcm (and not having to count instructions off myself).
#130002 - gladius - Wed May 30, 2007 12:57 am
Ant6n wrote: |
1) In some assemblers i have seen an org directives, i.e. saying org 0100h puts the statement following the directive at 0100h; does that exist on devitpro gas?
|
gas does support the .org directive, however it is quite limited. It operates within the current section only, and can only move the memory position forward. It'd probably be easier to use linkscripts and create new sections for areas you want perfectly aligned to some memory address.
Ant6n wrote: |
2) I intend to put the whole cpu interpreter in the itcm, put the emulated devices' memory in the main ram or vram, and keep the data cache for the memory. If i put ".section .itcm,"ax", %progbits ", and compile with the newest devkitpro, will that accomplish that, or do i have to dwelve into the docs for that ominous cp15?
|
Not sure, but CP15 isn't really that bad actually, especially if you are only playing with the basic caching mechanisms.
Ant6n wrote: |
4) I understand the itcm has a waitstate of 0, meaning a 1 cycle load if there is no interlook. Does anybody know the cycles for the other memory areas (i.e. vram, shared wram)?
|
shared iwram and vram are both waitstate 0 (vram does have cycle access penalties if being used by the gpu though).
Btw, trying to keep the emulated flags in CPSR will be very, very tricky. You will probably end up doing a bunch of msr,mrs combos unless the CPU architecture you are emulating maps very cleanly to ARM. Cycle counting/figuring out when to stop emulating is the one that usually causes a bunch of problems.
If you are intent on saving those cycles, I'd recommend going for a dynamic recompilation approach. I used this on the GBA to emulate the SPC chip from the SNES and it worked quite well. The source for that is up at http://forwardcoding.com/projects/spcemu.html.
Good luck :).
#130006 - Dwedit - Wed May 30, 2007 1:50 am
Which architecture is it? I'ts always good to see if it's been done already.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#130009 - Ant6n - Wed May 30, 2007 2:27 am
i am looking at 8186 real mode (/8086/8286). main memory is the main ram, and i am thinking of having the interrupt vector such that it can point to emulated code or arm code. Also, main ram should be protected, so that calls to the i/o mapped areas cause an exception to be handled. My hope is that arm modules can be added later emulating devices; although to be honest i have not found sufficent documentation how memory mapped i/o works on 8086, and also how people accessed high mem.
I have seen some talk about "DOS emulation", but i have not found any results.
I am aware of dynamic recompilation, but i think its too complex for me at this point. I think i can avoid using flags if all program flow is through switch statements.
Quote: |
You will probably end up doing a bunch of msr,mrs combos unless the CPU architecture you are emulating maps very cleanly to ARM. |
actually i only read the cpsr when all flags are needed (that's also when i calculate PF and AF).
#130010 - gladius - Wed May 30, 2007 2:55 am
Hm.. how do you do cycle counting, or stop the emulation loop then?
I suppose having a high frequency timer irq that checked the emulation state and stopped it if enough cycles had passed might work, but accuracy/overhead problems probably would rule this one out.
Another alternative would be to have a look-up table for the "remaining cycles" count, and execute the instruction there, which could be a stop emulation, or continue onwards. Something like this:
Code: |
@ r11 - cycles remaining lookup table
sub r12, r12, #cycles
ldr r0, [r11, r12, lsl #2]
b r0
--> after the jump, in the table
b nextOpcode
or
b stop
|
But you are sacrificing a register, and a healthy pool of 0-waitstate ram to get this method to be fast enough. Or you could add an instruction or two to reduce the ram requirements, but at that point, it'd probably be worth it to do the standard NZ flag optimization and keep CPSR free for normal program flow.
Anyhow, I'll be interested to see what you come up with, sounds like a fun project :).
#130018 - Ant6n - Wed May 30, 2007 7:41 am
i am not very concerned about accuracy. the programs of back in the day had to run on all cpu's from 4.7 to 24 MHz (86-286), emulation should be somewhere inbetween, if at all. if its too slow there is no point to count. if it's too fast i might add some register getting counted up, being checked by timer interrupt or vblank or so. i'd have an approximate cycle count that counts based on every opcode, and if an instruction runs in 3/8 cycles based on addressing, i just add 5 cycles.
I am much more concerned about memory i/o and extended memroy access and such
#130034 - Dwedit - Wed May 30, 2007 10:59 am
Usually memory mapped IO is done by having all memory accesses be a jump into a memory read/write routine. You use the upper bits of the address as a jump table.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#130065 - Ant6n - Wed May 30, 2007 6:35 pm
I have an idea how to implement it, but i dont know exactly what regions are essential to implement. There is no gbatek equivalent for 8286 pc ;)
#130119 - Dwedit - Thu May 31, 2007 1:42 am
Bochs source code?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#130153 - keldon - Thu May 31, 2007 10:39 am
There has got to be plenty of documentation on the 80286 processor.
#130234 - Ant6n - Fri Jun 01, 2007 9:42 am
i think i have found most of the general documentation needed, so i could plan the general layout and focus on the cpu emulation again.
i have some issues with emulating the arithmetic flags efficiently. I could either use the arm flags and for example for a 16 bit add do a
1) adds (a << 16), (b << 16)
or i could add them directly
2) add a,b
and then try to extract the flags.
If i do #1, which means i waste some time for moving bits back and forth (despite the shifts on arithmetic instructions), i can either load the cpsr into some other register, which wastes time, or i could try to maintain all flags in the cpsr and not use them until needed. this makes implementing segment overrides very difficult. BTW, do segment overrides on non-string instructions always override SS?
if i attempt #2, i can save some cycles because i have to shift around the data much less, but i also have to extract all the flags myself if needed, and that seems pretty messy business. the carry flag is easy, so is the sign flag but the overflow flag seems to have several conditions. are there simple/fast ways to check overflow?
thoughts?
it looks like i am pretty specific stuff here, but it does affect the general layout.
anton
#130238 - kusma - Fri Jun 01, 2007 10:53 am
I doubt it will make a big difference in reality, and I'd say method #2 sounds a lot easier to debug.
#130243 - tepples - Fri Jun 01, 2007 12:40 pm
keldon wrote: |
There has got to be plenty of documentation on the 80286 processor. |
But where is good documentation about the rest of the components on the bus?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#130249 - keldon - Fri Jun 01, 2007 2:17 pm
Well much of what is generally true today was generally true then, well actually it would be the other way around. I doubt he would want to emulate every aspect of the bus as he would have immediate trouble in making any use of [say] an ISA bus with no ISA card; although that would be an even 'cooler' project if it did that.
Imagine that, a complete virtual computer complete with 'random peripherals' and upgrades.
But if anything the first step is getting the CPU core working.
#130259 - Dwedit - Fri Jun 01, 2007 5:11 pm
Ant6n wrote: |
1) adds (a << 16), (b << 16)
|
The emulators Goomba and SMSAdvance have the 16 bit registers always in a shifted state. Consider the pros and cons of that approach as well.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#130263 - Ant6n - Fri Jun 01, 2007 5:39 pm
Quote: |
The emulators Goomba and SMSAdvance have the 16 bit registers always in a shifted state. Consider the pros and cons of that approach as well.
|
good idea.
as for documentation, i haven't found out how devices/drivers register their i/o with the system, but i found some general docs on some graphics, sound and xms, and they all seem to use either swi(/normal functions), ports or high memory areas (>640K) as interface.
the first two are straightforward to implement, irq's could even be made to either point to arm or x86 code, ports are always arm (i haven't seen anything on user applications registering new ports).
High memory areas can be done using the protection unit. some regions don't even have to be necessarily protected, like the cga area. i.e. the arm7 could just copy'n'translate the vid mem into vram a couple times a second and pretend to be a graphics adapter.
#130281 - gladius - Fri Jun 01, 2007 11:06 pm
The approach of keeping the registers shifted works very well. Loopy used that for his nes and snes arm cores. There is still some shifting neccesary when doing 8 bit operations however, but it works out quite well usually.
Manually computing the V flag is usually 3 or 4 ops. The NZ flags are easy of course, but C can be a bit tricky at times as well.
The snesadvance (this is one is particularly good), pocketnes, goomba and smsadvance source are all good places to look for good hints on how to design an efficient 8/16 bit emulation architecture.
#130321 - Ant6n - Sat Jun 02, 2007 7:30 am
ok, techincal question. I am trying to create a jump table. I have a bunch of labels (300-500), and i have a big array of 16bit values (~8000) that are suppossed to hold addresses of the labels. I'd just say something like
Quote: |
.section .itcm,"ax", %progbits
.align 2
Lo00_m0_r0-7:
...
Lo00_m1_r0-7:
...
.
.
LFF-3-7:
...
|
and then i'd like the array to look like
Code: |
JumpTable:
.hword **address of Lo00_m0_r0-7** @00-0-0
.hword **address of Lo00_m0_r0-7** @00-0-1
.hword **address of Lo00_m0_r0-7** @00-0-2
.hword **address of Lo00_m0_r0-7** @00-0-3
.hword **address of Lo00_m0_r0-7** @00-0-4
.hword **address of Lo00_m0_r0-7** @00-0-5
.hword **address of Lo00_m0_r0-7** @00-0-6
.hword **address of Lo00_m0_r0-7** @00-0-7
.hword **address of Lo00_m1_r0-7** @00-1-0
etc... |
how would i go about filling in the "**address of xxx**"?
thx,anotn
#130330 - gladius - Sat Jun 02, 2007 9:19 am
It's just:
Code: |
.word Lo00_m0_r0-7 @ simple version
.word Lo00_m0_r0-7, Lo00_m0_r0-8, Lo00_m0_r0-9, Lo00_m0_r0-10 @ multiple per line
|
But, those addresses will be .word's by default. You might be able to do something like:
Code: |
BaseAddress:
... buncha ops ...
Lo00_m0_r0-7:
...
.hword Lo00_m0_r0-7 - BaseAddress
|
To get that method to work. But if you want to build a direct jump-table, it'd need to be .word.
#130381 - Ant6n - Sat Jun 02, 2007 7:08 pm
thanks, that's simple enough. actually i am using a little trick, trying to move the itcm to the first 64k, or adding an offset manually as you suggested to save the memory
#130567 - Ant6n - Tue Jun 05, 2007 9:17 am
ok, i have been fiddling more with the design and trying to learn more about 80186 assembly (its realy from before my time) and the ds inards. did you know the 186 has only about 14500 possible commands, plus segment overrides? think possibilities with some metaprogramming, if only there were about a meg or a little less of fast ram...
Anyhoo, since I don't have the real hardware yet (i am migrating from gba programming), i have more questions about the DS:
1) if i do a
ldr r0, [r1, -r2, lsl 5]
such that the effective address is actually negative, does the value wrap around and give a very high address?
2) I am thinking about putting the dtcm at FFFF8000/FFFFC000. Is the DS going to complain since its close to the bios? what happens if i actually choose F0000000 as dtcm base and mirror it all the way to the end (so that the dtcm overlaps with the bios)? Is the dtcm going to be 'in front' or 'behind' the bios?
3) in gba tek it says that everything is counted in 33MHz cycles, and that one achieves 66 Mhz only in the tcm. What does that mean if I access vram, which suppossedly has 0(/1) waitstae? is it going to take longer than a cycle to load a word (assuming no interlock)?
thx
anton
Last edited by Ant6n on Tue Jun 05, 2007 6:09 pm; edited 1 time in total
#130572 - keldon - Tue Jun 05, 2007 10:56 am
1> maybe you can find out by seeing how the DOSBox source handles this?
#130643 - Ant6n - Wed Jun 06, 2007 2:37 am
i was actually asking for the ds behaviour, not the x86 bevhaviour.
people often seem to say 'just look in the source code', as if a source code is a form of documentation. i find that puzzling; it takes me forever to see the structure of programs other people have written, and its usually badly commented and badly documented; are there actually people that prefer a some written program to a good documentation?