#158208 - Maxxie - Sat Jun 07, 2008 1:15 pm
Heya dudes,
I was doing some investigation on memory access timings, to compare emulator accuracy with the hardware here. Well i have just run the first tests on the hardware (an old v3 fat DS) and i was somewhat surprised by the numbers.
Maybe some of you can explain it to me why?
a) 16bit read access to a used (as text bg) vram is as fast or faster then same access to the wram (while having the instructions cached)
b) read access to wram not yet in a cacheline takes much more time then accessing memory on its not at all cached mirror
The setup is:
- The Arm7 is in a endless loop not accessing wram, disabled IRQs
- The arm9 is running the tests
- instruction and data cache is cleared before the first iteration of each test
- no cache is cleared between iterations, except stated otherwise
- irq and dma are disabled
- time is taken with cascaded f/1 timer, polled (no irq)
- non sequential follow a predictable pattern: (iteration * 0x1234) & (memsize - 1)
- time is middled for one out of 100000 iterations
- all measurements use the same framing (GetTime();read=...;GetTime())
- optimization is turned off
- all memcnt registers are not modified
Timings:
Pure timer reads:
with IC cleared each iteration: 244 cycles
without clearing: 100 cycles
Wram (cached) reads
32 bit sequential: 108c
32 bit non sequential: 126c
16 bit sequential: 106c
16 bit non sequential: 127c
8 bit sequential: 103c
8 bit non sequential: 123c
Wram (mirror) reads
32 bit sequential: 112c
32 bit non sequential: 113c
16 bit sequential: 112c
16 bit non sequential: 111c
8 bit sequential: 112c
8 bit non sequential: 112c
Vram (textbg) reads
32 bit sequential: 112c
32 bit non sequential: 111c
16 bit sequential: 108c
16 bit non sequential: 105c
( http://www.speedshare.org/download.php?id=1F9B99493 if you like to test yourself )
PS: On the original comparison: desmume fails massively on the timings here, alltho i expected it to report less (internal)cycles/instruction, it reports ~340cycles for any access.
I was doing some investigation on memory access timings, to compare emulator accuracy with the hardware here. Well i have just run the first tests on the hardware (an old v3 fat DS) and i was somewhat surprised by the numbers.
Maybe some of you can explain it to me why?
a) 16bit read access to a used (as text bg) vram is as fast or faster then same access to the wram (while having the instructions cached)
b) read access to wram not yet in a cacheline takes much more time then accessing memory on its not at all cached mirror
The setup is:
- The Arm7 is in a endless loop not accessing wram, disabled IRQs
- The arm9 is running the tests
- instruction and data cache is cleared before the first iteration of each test
- no cache is cleared between iterations, except stated otherwise
- irq and dma are disabled
- time is taken with cascaded f/1 timer, polled (no irq)
- non sequential follow a predictable pattern: (iteration * 0x1234) & (memsize - 1)
- time is middled for one out of 100000 iterations
- all measurements use the same framing (GetTime();read=...;GetTime())
- optimization is turned off
- all memcnt registers are not modified
Timings:
Pure timer reads:
with IC cleared each iteration: 244 cycles
without clearing: 100 cycles
Wram (cached) reads
32 bit sequential: 108c
32 bit non sequential: 126c
16 bit sequential: 106c
16 bit non sequential: 127c
8 bit sequential: 103c
8 bit non sequential: 123c
Wram (mirror) reads
32 bit sequential: 112c
32 bit non sequential: 113c
16 bit sequential: 112c
16 bit non sequential: 111c
8 bit sequential: 112c
8 bit non sequential: 112c
Vram (textbg) reads
32 bit sequential: 112c
32 bit non sequential: 111c
16 bit sequential: 108c
16 bit non sequential: 105c
( http://www.speedshare.org/download.php?id=1F9B99493 if you like to test yourself )
PS: On the original comparison: desmume fails massively on the timings here, alltho i expected it to report less (internal)cycles/instruction, it reports ~340cycles for any access.