#27900 - tepples - Sat Oct 23, 2004 5:41 am
Apex Audio System author James Daniels claims his mixer runs at 1.0% CPU per channel at 32 kHz mono. That seems to imply an output loop running at 5 cycles per output sample. But how the heck does it do that? Even using unsigned samples for the faster 'ldrb' addressing modes and special-casing samples that don't loop, I can't seem to get my own mixer under 10 cycles per output sample:
Code: |
; building block of inner unrolled loop takes 10 cycles
; rsamp: temporary register
; rbase: base address of all samples
; r0pos: position within sample, fixed-point
; r0frq: playback frequency, fixed-point
; r0vol: volume, fixed-point
; rmix: mix bus
ldrb rsamp, [rbase, r0pos, lsr #PITCHDEPTH] ; 6 (2 address, 3 wait, 1 data)
add r0pos, r0pos, r0frq ; 1
mla rmix, r0vol, rsamp, rNmix ; 3 (1 setup, 1 mul, 1 add)
|
Are there any obvious optimizations I'm missing?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#27907 - ecurtz - Sat Oct 23, 2004 4:27 pm
Maybe JD's numbers are for the special case where the playback isn't pitch shifted (if that is the right term.) That would lead to obvious optimizations in the loading and reusing of the same sample data.
#27910 - jd - Sat Oct 23, 2004 4:58 pm
tepples wrote: |
Are there any obvious optimizations I'm missing? |
One optimisation would be to put two bytes of data in one register so you can use mla to do two multiply accumulates for the price of one:
Code: |
ldrb rsamp, [rbase, r0pos, lsr #PITCHDEPTH]
add r0pos, r0pos, r0frq
ldrb rsamp2, [rbase, r0pos, lsr #PITCHDEPTH]
add r0pos, r0pos, r0frq
add rsamp, rsamp, rsamp2, lsl #16
mla rmix, rsamp, r0vol, rNmix
|
Also, I'm pretty sure that an ldrb from ROM (assuming 3.1 timing) with this addressing mode takes 5 cycles, not 6. So in total the method above works out at 8 cycles per sample. (AAS does use this trick, but is also uses a lot of others which I'd prefer not to reveal just yet.)
ecurtz wrote: |
Maybe JD's numbers are for the special case where the playback isn't pitch shifted (if that is the right term.) That would lead to obvious optimizations in the loading and reusing of the same sample data. |
No, the figures are based on the average case and were calculated by playing a MOD (so they include the time taken to write to the buffer (which is in EWRAM to save IWRAM) and by the MOD player and other overheads) on actual hardware. The library is freely available if you want to check the figures.
#27911 - tepples - Sat Oct 23, 2004 5:12 pm
jd wrote: |
One optimisation would be to put two bytes of data in one register so you can use mla to do two multiply accumulates for the price of one: |
I guess I'll just have to weigh the advantages of hard panning vs. intensity panning. My current design stores volumes as 0x00100020 for L=16 R=32.
Quote: |
Also, I'm pretty sure that an ldrb from ROM (assuming 3.1 timing) with this addressing mode takes 5 cycles, not 6. |
The 'ldr' instructions take 3 cycles. The wait state adds 3 more.
Quote: |
So in total the method above works out at 8 cycles per sample. (AAS does use this trick, but is also uses a lot of others which I'd prefer not to reveal just yet.) |
Do they involve self-modifying code?
Quote: |
the figures are based on the average case and were calculated by playing a MOD |
With how much channel silence?
Quote: |
The library is freely available if you want to check the figures. |
How long must the logo be displayed? And what exactly does "non-commercial" mean? Some publishers and developers have taken it to mean that I can't sell an old computer with a "non-commercial" program on the hard drive.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#27912 - poslundc - Sat Oct 23, 2004 5:57 pm
tepples wrote: |
I guess I'll just have to weigh the advantages of hard panning vs. intensity panning. My current design stores volumes as 0x00100020 for L=16 R=32. |
It has been my observation that the low quality of the hardware (in my GBA anyway) makes intensity panning a wasted effort most of the time.
With every plug I've tried, there's always some signal leak between the two channels. So even if I have no sound playing in one channel I can still always hear the other channel not-so-faintly.
So, I do what James does and process two samples at once. My mixer is also only 8 channels (4 and 4 hard-panning) so I can scale rather than clip without significant volume loss. Saves a lot of cycles that way.
Dan.
#27914 - jd - Sat Oct 23, 2004 6:13 pm
Quote: |
Quote: | Also, I'm pretty sure that an ldrb from ROM (assuming 3.1 timing) with this addressing mode takes 5 cycles, not 6. |
The 'ldr' instructions take 3 cycles. The wait state adds 3 more.
|
IIRC it depends on the addressing mode. i.e.:
Code: |
ldrb rsamp, [rbase, r0pos, lsr pitchdepth] ; 6 cycles
ldrb rsamp, [rbase, r0pos, lsr #PITCHDEPTH] ; 5 cycles
|
Quote: |
Quote: | So in total the method above works out at 8 cycles per sample. (AAS does use this trick, but is also uses a lot of others which I'd prefer not to reveal just yet.) |
Do they involve self-modifying code?
|
Some of them might. :)
Quote: |
Quote: | the figures are based on the average case and were calculated by playing a MOD |
With how much channel silence?
|
None - it was taken into account when the figure was calculated. I promise I'm not trying to trick you with misleading figures.
Quote: |
How long must the logo be displayed? |
The logo is displayed using the AAS_ShowLogo() function. It displays it for 3.6 seconds (including transistions).
Quote: |
And what exactly does "non-commercial" mean? |
Anything where you're not making money from it. (Although with borderline cases (like the gbadev.org compo and shareware) it's best to check with me. (In the case of the compo, it counts as non-commercial as the returns are likely to be quite small.))
Quote: |
Some publishers and developers have taken it to mean that I can't sell an old computer with a "non-commercial" program on the hard drive. |
Well, that's daft of them and it certainly doesn't apply in this case.
#27918 - DekuTree64 - Sat Oct 23, 2004 8:17 pm
Another nice trick is to write special mixing loops for cases where the channel is playing at a lower frequency than the mixing rate. For example, playing at 8KHz when the main rate is at 16KHz, you're repeating the same sample twice and so wasting 3 cycles per sample. Then at 32KHz, you only have to load once for every 4 output samples, so even better savings.
Here's how I did it in my ldrb-add-mla mixer (although it does always use at least one cycle for the conditional load). You add the integer portion of pos onto the data pointer, and shift pos up to chop the integer portion off, so when the fractional portion overflows into the integer, it sets the carry bit. You know you'll never miss a sample, since you're always advancing less than one at a time, so you can use carry to decide wether to do the load:
Code: |
ldrb rTemp, [rData]
adds rPos, rPos, rInc
mla rDest1, rVol, rTemp, rDest1
ldrcsb rTemp, [rData, #1]!
adds rPos, rPos, rInc
mla rDest2, rVol, rTemp, rDest2
// ... repeats of the second one
// Handle the last potential overflow before returning
addcs rData, rData, #1 |
A fun way to save a register is to shove rPos and rInc into the same one. Assuming PITCHDEPTH <= 16 bits, rPos will fit in the upper 16 bits of the register, and you know rInc < 1 so it only needs 16 bits too. Put inc in the bottom, pos in the top, and then
Code: |
adds rPosInc, rPosInc, rPosInc, lsl #16 |
No faster, but gives you one more register to save time elsewhere with.
And for the record, I always read that ldrb will always take 3 cycles+waits, so it would be 6, not 5. I know data processing instructions take one extra cycle for using a shift by register, but I don't think load does. Even if it does though, that should bump it to 7 cycles rather than down if you don't do it.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#27928 - tepples - Sun Oct 24, 2004 3:50 am
Thanks DekuTree64 for the conditional load trick. It begins to explain the less-than-expected difference in CPU utilization between 16 kHz and 32 kHz modes of AAS. Would it also help to split the mla into a mul and add (no cycle penalty) and then conditional the mul?
The rPos/rInc combination thing was a bit tricky to figure out, as I had planned on using 22.10-bit offsets (#define PITCHDEPTH 10) with samples inside a 4 MB window. I'd need an rBase per voice rather than per mix-run, but it would free up one register.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#27932 - DekuTree64 - Sun Oct 24, 2004 9:20 am
Nice idea splitting up the mul. I just ran some speed tests on several instructions, and here are the results:
mul: 2 cycles
mla: 3 cycles
mul, add: 3 cycles
mul, add+shift by immediate: 3 cycles
mul, add+shift by register: 4 cycles
ldrb, reg, [reg] (from IWRAM): 3 cycles
ldrb reg, [reg, reg] (from IWRAM): 3 cycles
ldrb reg, [reg] (from ROM): 6 cycles
ldrh reg, [reg] (from ROM) 6 cycles
ldr reg, [reg] (from IWRAM): 3 cycles
ldr reg, [reg] (from ROM): 8 cycles
All code is in IWRAM, ROM is set to 3:1 waitstates. This confirms that the docs are accurate, and that mla is only useful to save needing a temporary register to store the result before adding, and to save 4 bytes of code.
Also, I found out that ldrb reg, [reg, reg, lsr reg] is not a valid instruction, which is why I'd never considered the timing of it.
Another thing, sequential ROM access ONLY counts on ldr. I tried ldrb, incrementing by 0, 1, 2, 4, and 300, and it always took 6 cycles.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#27934 - poslundc - Sun Oct 24, 2004 2:17 pm
DekuTree64 wrote: |
Also, I found out that ldrb reg, [reg, reg, lsr reg] is not a valid instruction, which is why I'd never considered the timing of it. |
You should be able to do a logical shift by a constant amount; you can't shift by a register, though.
Quote: |
Another thing, sequential ROM access ONLY counts on ldr. I tried ldrb, incrementing by 0, 1, 2, 4, and 300, and it always took 6 cycles. |
In these tests, where was the code running from? Was it IWRAM?
Dan.
#27936 - tepples - Sun Oct 24, 2004 3:17 pm
Quote: |
In these tests, where was the code running from? Was it IWRAM? |
DekuTree64: "All code is in IWRAM, ROM is set to 3:1 waitstates."
Anyway, 'mul rdest, ra, rb' or 'mla rdest, ra, rb, rplus' can take different numbers of cycles depending on the value of the rb operand, so take that into account in your timing. Luckily, Martin Korth's GBATEK points out that mul and mla with values of rb in the range -256 through 255 finish in the fastest possible time and that both signed and unsigned sample values are always in that range.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#27945 - DekuTree64 - Sun Oct 24, 2004 7:05 pm
Ok, tried out those cases just to be sure, and mla is still one slower (and does take more cycles depending on magnitude of rb), and ldrb reg, [reg, reg, lsr #imm] still takes 3 cycles (from IWRAM).
Oh, and here's another nice assembly trick for mixing chip sounds that loop a lot (which also takes 2 regs, data and pos). Rather than comaparing data+pos to the end of the sample, and then subtracting the loop length, you set data to the end of the sample, and recalculate pos to be a negative offset from there. Then when pos becomes >= 0, you know you've hit/passed the end, and need to subtract the loop length.
If I remember right, it was
Code: |
newPos = ((data-end)<<pitchdepth) + pos
newData = end |
Provided that you're close enough to ending that it won't overflow the integer portion of your pos, and that the loop length is short enough not to overflow it when you subtract it:
Code: |
adds rPos, rPos, rInc
subpl rPos, rPos, rLoopLength << pitchdepth |
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#27951 - bertsnks - Sun Oct 24, 2004 8:55 pm
tepples wrote: |
Even using unsigned samples for the faster 'ldrb' addressing modes... <snip> |
Wait, when using unsigned samples, how do you get a signed sample back in the end?
one way i can think of substracting the unsigned result by 128*nr_of_channels_mixed.
If doing so, doesn't that mean that you are obligated to mix "silence" aswell (add 128)? where as if you are using signed samples, you can just skip it?
#27953 - tepples - Sun Oct 24, 2004 11:27 pm
bertsnks wrote: |
tepples wrote: | Even using unsigned samples for the faster 'ldrb' addressing modes... <snip> |
Wait, when using unsigned samples, how do you get a signed sample back in the end?
one way i can think of substracting the unsigned result by 128*nr_of_channels_mixed. |
Actually, one must subtract 128*sum_of_volumes.
Quote: |
doesn't that mean that you are obligated to mix "silence" aswell (add 128)? where as if you are using signed samples, you can just skip it? |
Yes, the mixer does have to mix silence for the remainder of stopping samples, but then again, I'd guess that a mixer optimized for unsigned samples mixes non-looping samples first and then uses a slower special-case path for looping or stopping samples.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#27954 - jd - Sun Oct 24, 2004 11:47 pm
DekuTree64 wrote: |
Also, I found out that ldrb reg, [reg, reg, lsr reg] is not a valid instruction, which is why I'd never considered the timing of it.
|
Sorry, my mistake. I was thinking of bit-shifting by a register in a data processing instruction.
#27955 - DekuTree64 - Mon Oct 25, 2004 12:17 am
Ok, more speed testing, giving different results than I'd expected. Test code (ARM):
Code: |
mov r0, #65536
loop:
subs r0, r0, #1
bne loop |
Cycles here are full time / 65536.
Running from IWRAM: 4 cycles
Running from ROM: 18 cycles
Running from VRAM: 8 cycles
Similar loop in thumb:
IWRAM: 4 cycles
ROM: 10 cycles
VRAM: 4 cycles
So, VRAM is indeed 0 waitstate. However, I then placed the ARM code in a part of the screen base block I was using to display the output text and waited until VCOUNT was 0, and it took 0x8798d cycles total, so a little less than 8.5 per run. This confirms that conflicts between the graphics hardware and CPU will cause a stall.
After that, I tried running that same code during VBlank, then from an unused portion of VRAM that was never getting accessed, and it still stalled both times.
Next I put it at 0x6010000 (sprite VRAM), and it DIDN'T stall. I had sprites turned off though, so I turned them on, and then it took 0x62288 cycles, so a little bit of stalling but not as much. Most likely because all sprites were 8x8 in the top left corner, so after running the first 8 lines of the screen, it didn't stall anymore. Then tried with all sprites' SD flag set, no stall.
Conclusion: VRAM is the next best thing to IWRAM for code, but can be unpredictable.
Also, it seems there is some difference in how BG and sprite VRAM are accessed, which I hadn't expected since mode4 uses part of sprite VRAM on the screen.
Lastly, the SD flag does indeed stop all processing of a sprite.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#27956 - poslundc - Mon Oct 25, 2004 1:03 am
Fascinating stuff, DekuTree64. Thanks for taking the time to actually answer these questions once and for all.
Dan (REALLY disappointed about the sequential loads from ROM...).
#27986 - Gene Ostrowski - Mon Oct 25, 2004 7:57 pm
A quick question--on code such as this that was posted in an earlier message:
Code: |
ldrb rsamp, [rbase, r0pos, lsr #PITCHDEPTH]
add r0pos, r0pos, r0frq
ldrb rsamp2, [rbase, r0pos, lsr #PITCHDEPTH]
add r0pos, r0pos, r0frq
add rsamp, rsamp, rsamp2, lsl #16
mla rmix, rsamp, r0vol, rNmix
|
... don't the sequential references to r0pos and rsamp cause pipeline stalls, thus adding cycles to the counts? Or is there something about the ARM architecture that addresses this and updates the register after it's loaded/modified?
A couple of thoughts about the optimization:
Has anyone tried using 0.32 fixed point and ignorning registers for POS and INC and just tracking INC?
Code: |
...
adds rInc, rInc, rFreq
ldrcsb rSamp, [rData++] // dont remember the syntax for autoincrement address
mov rMix, rSamp
adds rInc, rInc, rFreq
ldrcsb rSamp, [rData++]
mov rMix2, rSamp
add rmix, rmix, rmix2 lsl #16
mla rmix, rsamp, rvol, ...
|
If we impose a restriction that all sample rates must be less than the mixing rate, we'll always have a INC <1. If you don't want that restriction, check for the case >=1 before mixing, and use a different mixer that adds the integer portion w/carry to rData also.
Also, it seems that most of the effort is trying to avoid the six cycle data load. Has anyone tried to load 32 bits of sample data into a temp register and bit-shifting 8 bits out at a time for the sample? Combined with the logic to avoid a load altogether if there's no overflow:
Code: |
load 32 bits from ROM into temp
add offset
if offset overflows // time for new sample
if temp==0 // time for next 4 bytes
load 32 bits into temp
else
sample=<shift 8 bits out of temp>
else
sample remains the same
process sample
|
For this to work, we'd need to impose a restriction that a sample could not contain a high-byte word-aligned zero. (Not a big restriction, though, as the sample could be written to ROM with a 1 in this byte instead of a zero, which the ear probably wouldn't detect anyway).
Even with a test fail, a compare, and a shift to load the new sample, we should save a few cycles for every 4 samples, where we'd normally need a guaranteed 6-cycle load each sample. I.E. we'd average less than 6-cycles for each ROM sample load.
Finally:
One more drastic thought is to resample the source at "compile" time to match (i.e. be a multiple of) the desired mixing frequency. For example, if we know that we will be mixing at 26758 hz, resample all the samples to 8919 hz and store them in ROM at that freq when the app is compiled, instead of their original frequency. Then, each sample would be mixed (in this case) three times per destination byte, thus avoiding the increment/overflow check altogether. I doubt most folks modify the master mix rate at runtime anyway, so why bother to resample a 8535 hz sample into a 26758 hz sample every single time it's used... This might not help much for mod samples that need adjusted frequencies, but for static soundfx mixing it should help.
_________________
------------------
Gene Ostrowski
#27990 - poslundc - Mon Oct 25, 2004 8:24 pm
Gene Ostrowski wrote: |
A quick question--on code such as this that was posted in an earlier message:
Code: |
ldrb rsamp, [rbase, r0pos, lsr #PITCHDEPTH]
add r0pos, r0pos, r0frq
ldrb rsamp2, [rbase, r0pos, lsr #PITCHDEPTH]
add r0pos, r0pos, r0frq
add rsamp, rsamp, rsamp2, lsl #16
mla rmix, rsamp, r0vol, rNmix
|
... don't the sequential references to r0pos and rsamp cause pipeline stalls, thus adding cycles to the counts? Or is there something about the ARM architecture that addresses this and updates the register after it's loaded/modified? |
I'm not certain which sequential references in particular you're referring to. But the pipeline is fetch-decode-execute, so the fetch and decode stages don't require the results of the execute to be determined unless it's a branch statement, which causes a pipeline flush. (Short answer: no.)
Quote: |
Has anyone tried using 0.32 fixed point and ignorning registers for POS and INC and just tracking INC? |
I think I did this with my mixer, although my memory is a bit fuzzy. There is a bit of a tradeoff, though, because I was also using the position-tracker register as my pointer to memory, so it meant an extra addition statement as well (to place it in ROM) that cancelled out most of the advantage.
Quote: |
Also, it seems that most of the effort is trying to avoid the six cycle data load. Has anyone tried to load 32 bits of sample data into a temp register and bit-shifting 8 bits out at a time for the sample?
...
For this to work, we'd need to impose a restriction that a sample could not contain a high-byte word-aligned zero. (Not a big restriction, though, as the sample could be written to ROM with a 1 in this byte instead of a zero, which the ear probably wouldn't detect anyway). |
I considered doing precisely this but didn't like the idea of having non-zero values. In hindsight I probably should've bit the bullet and done it for the sake of the speed increase.
Quote: |
One more drastic thought is to resample the source at "compile" time to match (i.e. be a multiple of) the desired mixing frequency. For example, if we know that we will be mixing at 26758 hz, resample all the samples to 8919 hz and store them in ROM at that freq when the app is compiled, instead of their original frequency. Then, each sample would be mixed (in this case) three times per destination byte, thus avoiding the increment/overflow check altogether. I doubt most folks modify the master mix rate at runtime anyway, so why bother to resample a 8535 hz sample into a 26758 hz sample every single time it's used... This might not help much for mod samples that need adjusted frequencies, but for static soundfx mixing it should help. |
Sound effects = cool idea
Sampled music = as you say...
Dan.
#28011 - Gene Ostrowski - Tue Oct 26, 2004 4:14 am
Quote: |
I'm not certain which sequential references in particular you're referring to. But the pipeline is fetch-decode-execute, so the fetch and decode stages don't require the results of the execute to be determined unless it's a branch statement, which causes a pipeline flush. (Short answer: no.) |
I'm specifically referring to instruction such as:
add r0pos, r0pos, r0frq
ldrb rsamp2, [rbase, r0pos, lsr #PITCHDEPTH]
The second instruction references a register (r0pos) that was loaded and modified during the first. Depending on how the pipeline is constructed, by the time the "add" is executed, the next instruction has already been fetched and has constructed the address using the "old" r0pos value. To keep this from happening, many pipelines will force a stall (but not a flush) in order to ensure that the writeback for the 1st instruction updates the register prior to its use in the address calculation of the 2nd-- thus at least one lost cycle. Some pipelines have fancy built-in writeback mechanisms to combat this problem, but some don't, and even those that do only do this under certain situations.
I'm wondering how the GBA's ARM is constructed. I did some searching back a while ago and didn't come across any documentation that went into the detail of the pipeline construction.
---
Another thought, assuming you have a little bit of memory to spare, is to copy frequently used sounds or instruments (small ones) to fast RAM and mix them from there instead of ROM. Many small samples that are used frequently (shaceship shooting, engine drone, explosions) could fit into a small (maybe 16K or 32K) buffer. This could help save a few scanlines if there are always a few of these going constantly, or could allow a few additional voices to your mixer without increasing CPU usage.
_________________
------------------
Gene Ostrowski
#28015 - tepples - Tue Oct 26, 2004 4:38 am
Computer Organization and Design, a textbook about the MIPS architecture, describes "data forwarding". A multiplexer at each pipeline stage after register-fetch compares the source register numbers to future stages' destination register numbers and pulls the data back if necessary. Otherwise, every data processing instruction would have delay slots (compare those of MIPS load and branch instructions) during which the old value would still be in effect, which would introduce unacceptable latency.
In practice, based on what I've seen of disassemblies of programs that run on both GBA and GB Player, ARM7 doesn't seem to have these delay slots, which means it must either perform writeback before register-fetch or have data forwarding.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#28017 - DekuTree64 - Tue Oct 26, 2004 5:03 am
The GBA's ARM7 will never have any stalls due to instruction ordering. The second and third of the 3 cycles used by a load are actually sort of a 'required' stall.
Regular data processing instructions (add, mov, orr, etc.) always take 1 cycle to do their entire job, which is pretty impressive considering how many steps are involved. Check condition, fetch registers, run through shifter, operate, set CPU flags if needed, write back to a register. The amount of work you can get done in a single cycle is what makes writing clever ARM assembly so much fun.
And sample caching, you only HAVE 32K of fast RAM, but if you're not using it for anything else, you could use half of it as a sample cache. I doubt you'd save even 1% of your total CPU on average though. Much better to fill that space with code. Just do a little profiling and shove in whatever functions your game spends the most time in.
I too have pondered a mixer based on loading whole words, but also decided not to. It may really be worth it though. I'll try writing an inner loop for it and see if it would work.
EDIT: I think Tepples is right that it does use data forwarding to get the result of one instruction into the next without having to wait for the register to write and then read back.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
Last edited by DekuTree64 on Tue Oct 26, 2004 5:23 am; edited 1 time in total
#28018 - Gene Ostrowski - Tue Oct 26, 2004 5:12 am
Quote: |
In practice, based on what I've seen of disassemblies of programs that run on both GBA and GB Player, ARM7 doesn't seem to have these delay slots, which means it must either perform writeback before register-fetch or have data forwarding. |
Unfortunately, any pipeline delay wouldn't ever show up in a disassembly, it would just show up in a performance timing test.
I suppose it would be fairly easy to test: just throw a NOP in between two instructions that reference a register/memory location that is altered. If the test runs at the same speed, then the CPU was already putting a NOP into the pipeline for you and writebacks and/or data forwards were not being performed.
Adding these capabilities to the pipeline is complex and adds a lot of circuitry to an architecture. I'd be surprised if the ARM has full support for all cases, all instructions, at any point in the pipeline. It would be interesting to find out though...
_________________
------------------
Gene Ostrowski
#28034 - poslundc - Tue Oct 26, 2004 2:01 pm
The ARM7TDMI specs I have on my computer at home cover the instruction timing and pipeline performance; they used to be available off of ARM's website but the doc formats seem to have changed and I can't find a link to the same document. In any case, the specs are very clear on what the timing for each instruction is and what causes a pipeline stall.
Since the execution of instructions happens sequentially and the results of the previous instruction are guaranteed to be calculated before the next instruction instruction executes, it's reasonable to assume that some kind of data forwarding is taking place if registers are normally loaded in the decode stage.
Dan.
#28044 - Gene Ostrowski - Tue Oct 26, 2004 3:42 pm
Quote: |
And sample caching, you only HAVE 32K of fast RAM, but if you're not using it for anything else, you could use half of it as a sample cache. I doubt you'd save even 1% of your total CPU on average though. Much better to fill that space with code. Just do a little profiling and shove in whatever functions your game spends the most time in. |
I was referring to the 256K bank of RAM vs the slow-to-access ROM.
Quote: |
The ARM7TDMI specs I have on my computer at home cover the instruction timing and pipeline performance; they used to be available off of ARM's website but the doc formats seem to have changed and I can't find a link to the same document. In any case, the specs are very clear on what the timing for each instruction is and what causes a pipeline stall. |
Cool, if your docs address the pipeline performance than it seems to be settled. This is good to know. My ARM docs discuss just about everything except pipeline performance, so I've always wondered. And yeah, I search a while on ARMs site and couldn't find the docs either, though I thought they would be easily located.
_________________
------------------
Gene Ostrowski
#28049 - tepples - Tue Oct 26, 2004 4:12 pm
Gene Ostrowski wrote: |
Unfortunately, any pipeline delay wouldn't ever show up in a disassembly |
It would if instructions had delay slots, as load and branch do on MIPS.
Quote: |
it would just show up in a performance timing test. |
True, but in tests I've done based on switching the order of subsequent instructions after data processing instructions, it hasn't shown up there either.
Quote: |
Adding these capabilities to the pipeline is complex and adds a lot of circuitry to an architecture. |
Have you read CO&D? Data forwarding is not that hard.
Quote: |
I'd be surprised if the ARM has full support for all cases, all instructions, at any point in the pipeline. It would be interesting to find out though... |
Well happy birthday!
Quote: |
I was referring to the 256K bank of RAM vs the slow-to-access ROM. |
EWRAM takes 2 wait states for a random 8-bit or 16-bit access. ROM takes 3 wait states. It's not that much faster. Of course, one could cache samples in an unused area of VRAM (the second fastest RAM in the GBA, with about half a wait state per read).
Anyway, it's likely that ARM doesn't publish certain documents on the web site because it wants you to buy them in dead-tree form.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#28097 - Gene Ostrowski - Tue Oct 26, 2004 9:57 pm
This is interesting. How extensive was your testing on switching instruction order?
I ask this because a while back I was writing a Mode 3 dual playfield scroller and was hand-coding it in ASM to get the best out of it. I was pulling the background layer from ROM and the (transparent) foreground layer from RAM. It was running at 28-30FPS, including my game logic, which I thought was pretty good, but wanted more.
I had all along been assuming that there were pipeline stalls due to instruction ordering issues, and started playing around with instruction order. There were many cases where the order actually changed the performance substantially. In some cases, I'd see a decrease in performance where I was expecting an increase, and an increase where I wasn't expecting to see anything. I was never able to nail down exactly what the "stalls" were caused by, but I was sure something was going on because of my performance data.
I wasn't sure if it was due to the timing of memory accesses, since I was pulling from both ROM and RAM (I also had ROM prefetch enabled), or if it was actual instruction ordering that was causing things to be different.
Question: were your instruction orderings more geared towards testing register-based instructions, or load/store instructions?
_________________
------------------
Gene Ostrowski
#28098 - poslundc - Tue Oct 26, 2004 10:04 pm
Gene Ostrowski wrote: |
I wasn't sure if it was due to the timing of memory accesses, since I was pulling from both ROM and RAM (I also had ROM prefetch enabled), or if it was actual instruction ordering that was causing things to be different. |
Where was your code running from? Prefetch apparently only affects the loading of executed code running from ROM, not the loading of generic data.
Switching around your load statements with prefetch enabled could have a huge impact on the speed of your code if it's running from ROM. Running from RAM, I'm inclined to think not so much.
Dan.
#28103 - Gene Ostrowski - Tue Oct 26, 2004 10:27 pm
All the code was running from IWRAM.
"Prefetch apparently only affects the loading of executed code running from ROM, not the loading of generic data. "
You are implying that the CPU can make the discinction about data that I am loading to execute vs. data I am loading to just load? Unless you are referring to the program-counter-based "next instruction"-type fetch, I don't see how the CPU will know what I plan on doing with the data.
Or is there an instruction fetch bus or a signal to the address decoder that tells it the destination of the loaded data?
I thought the architecture prefetched the next ROM memory location based upon the last fetched address, regardless of whether it was an instruction fetch or a register load.
_________________
------------------
Gene Ostrowski
#28108 - poslundc - Tue Oct 26, 2004 10:55 pm
Gene Ostrowski wrote: |
"Prefetch apparently only affects the loading of executed code running from ROM, not the loading of generic data. "
You are implying that the CPU can make the discinction about data that I am loading to execute vs. data I am loading to just load? Unless you are referring to the program-counter-based "next instruction"-type fetch, I don't see how the CPU will know what I plan on doing with the data.
Or is there an instruction fetch bus or a signal to the address decoder that tells it the destination of the loaded data?
I thought the architecture prefetched the next ROM memory location based upon the last fetched address, regardless of whether it was an instruction fetch or a register load. |
I have no idea; I'm just parroting what others on this forum have investigated, hence "apparently". If you search the forum you might find the threads in question.
If I were to make an educated guess (it's been years since I've studied the MIPS architecture), I'd say it's not unreasonable that the instruction-processor would have its own mechanism for caching of instructions.
Dan.
#28124 - tepples - Wed Oct 27, 2004 1:56 am
Gene Ostrowski wrote: |
Or is there an instruction fetch bus or a signal to the address decoder that tells it the destination of the loaded data? |
Yes, and the signal is called 'PROT[0]'.
The MOS Technologies 6502 datasheet specifies a signal 'SYNC' distinguishing the first byte of an instruction fetch from any other read; this was initially designed to allow an external debugger controller to clock the CPU an instruction at a time. The WDC 65C816 datasheet (PDF) expands on this with two signals 'VDA' and 'VPA' used to distinguish among the first byte of an instruction fetch, subsequent bytes of an instruction fetch, data fetches, and internal operations.
The ARM architecture was designed by fans of the 6502 series; the ARM7TDMI datasheet specifies an analogous signal 'PROT[0]' distinguishing instruction fetches from data fetches, apparently designed to let an external memory controller implement W^X memory protection but also useful for implementing instruction prefetch.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.