gbadev.org forum archive

Heya dudes,

I was doing some investigation on memory access timings, to compare emulator accuracy with the hardware here. Well i have just run the first tests on the hardware (an old v3 fat DS) and i was somewhat surprised by the numbers.

Maybe some of you can explain it to me why?

a) 16bit read access to a used (as text bg) vram is as fast or faster then same access to the wram (while having the instructions cached)

b) read access to wram not yet in a cacheline takes much more time then accessing memory on its not at all cached mirror

The setup is:
- The Arm7 is in a endless loop not accessing wram, disabled IRQs
- The arm9 is running the tests
- instruction and data cache is cleared before the first iteration of each test
- no cache is cleared between iterations, except stated otherwise
- irq and dma are disabled
- time is taken with cascaded f/1 timer, polled (no irq)
- non sequential follow a predictable pattern: (iteration * 0x1234) & (memsize - 1)
- time is middled for one out of 100000 iterations
- all measurements use the same framing (GetTime();read=...;GetTime())
- optimization is turned off
- all memcnt registers are not modified

Timings:
Pure timer reads:
with IC cleared each iteration: 244 cycles
without clearing: 100 cycles

Wram (cached) reads
32 bit sequential: 108c
32 bit non sequential: 126c
16 bit sequential: 106c
16 bit non sequential: 127c
8 bit sequential: 103c
8 bit non sequential: 123c

Wram (mirror) reads
32 bit sequential: 112c
32 bit non sequential: 113c
16 bit sequential: 112c
16 bit non sequential: 111c
8 bit sequential: 112c
8 bit non sequential: 112c

Vram (textbg) reads
32 bit sequential: 112c
32 bit non sequential: 111c
16 bit sequential: 108c
16 bit non sequential: 105c

( http://www.speedshare.org/download.php?id=1F9B99493 if you like to test yourself )

PS: On the original comparison: desmume fails massively on the timings here, alltho i expected it to report less (internal)cycles/instruction, it reports ~340cycles for any access.

Quote:

a) 16bit read access to a used (as text bg) vram is as fast or faster then same access to the wram (while having the instructions cached)

The VRAM is a bit faster than the main ram. If it's in use by the video hardware a small waitstate may occur, though I would expect cached main ram to be very much faster. :\

Quote:

b) read access to wram not yet in a cacheline takes much more time then accessing memory on its not at all cached mirror

When you miss the cache (and the data is in a cached region), the cache has to load 8 words of data from that area (32 bytes being the size of each cache line). If the memory isn't cached, you may be able to access a single word faster. But if you access multiple words in the same [uncached] spot, you'll probably end up with a larger cpu load.

eKid wrote:

Quote:

b) read access to wram not yet in a cacheline takes much more time then accessing memory on its not at all cached mirror

When you miss the cache (and the data is in a cached region), the cache has to load 8 words of data from that area (32 bytes being the size of each cache line). If the memory isn't cached, you may be able to access a single word faster. But if you access multiple words in the same [uncached] spot, you'll probably end up with a larger cpu load.

This might as well explain the cached not faster then vram issue too. I expected the cache to operated more asynchronous (fetch the rest in the cachline while the cpu continues)

In sequencial access cachemisses are issued every 32 iterations, as it is initially empty (flushed before first iteration)

I will think of a more detailed test to measure pure cache access.
:edit: i did now check different approaches, from doing the access twice each iteration (second measured, first to get it into cache) or to access the very same address for all iterations did not change much. ~107c

Thanks for the explaining.

After looking at the assembly of the iterations, i noticed that even tho the sourcecode did only use different constant and varsizes but looked identical, it created quite some different code.

I have changed this now by doing the things that are different in each case in .arm assembler, so i have full control on it.
Also i fixed a measuring inaccuracy that was caused by possible timerwraparounds after while reading the timer values.
Tara is now used to correct the timings.

(Tara is measured by a dummy call to .arm code containing only "bx lr", where the reads have a ldr/ldrh/ldrb just before this bx lr)

That lead to MUCH more explainable timings:
Cached Wram timings are now all 2 cycles per read. (Both sequential and non sequential)

The mirror takes 10(for 32 bit access) or 9 (for smaller) cycles

VRAM takes 6 for 32 and 5 cycles for 16bit.

These timings are now constant through all runs, but i will investigate further if there is any inconsistency left between measurements.

Am thinking on creating a detailed table on instruction timings on the DS for hand optimizing code

Have you seen this?
http://nocash.emubase.de/gbatek.htm#dsmemorytimings

Also, http://nocash.emubase.de/gbatek.htm#arminstructionsummary, for an overview of instruction timings.

Yes, i have seen them, as well as the timings in the DDI documents form arm.

I however have never understood the I,S,N cycles. I really prever the plain numbers and to have a system that can compare the execution time of single instructions through the different devices. "Its feels more real"

Maxxie wrote:

After looking at the assembly of the iterations, i noticed that even tho the sourcecode did only use different constant and varsizes but looked identical, it created quite some different code.

This could be because you had optimizations off. The assembly produced under -O0 is just awful. That said, the code you get for small copy loops with -O2/-O3 can be somewhat odd as well. This is especially true in Thumb code, where ldr, ldrh and ldrb don't have the same capabilities.

eKid wrote:

Have you seen this?
http://nocash.emubase.de/gbatek.htm#dsmemorytimings

Also, http://nocash.emubase.de/gbatek.htm#arminstructionsummary, for an overview of instruction timings.

Note that the arm instruction summary in GBATek still deals with the ARMv4 architecture. The ARM9 is an ARMv5 chip, which if I recall correctly has a different set timings for loads and stores.

Got an accurate cycle count for accessing uncached WRAM ?

I'm curious how much of a speed advantage there is to accessing the uncached mirror in cases where you know you're going to be missing the cache consistently.

(.arm instructions used)
LDR r0,[r0,r1] takes 10cycles on the 0x02400000-0x027FFFFF mirror
LDRH/B r0,[r0,r1] takes 9cycles on the 0x02400000-0x027FFFFF mirror

There is no difference in sequential and non sequential access (the psram is initialized as continously/non burst)
Same instructions on the cached mirror takes 2 cycles for 32 as well as 16bit read access

Cycles are measured as f/1 timerticks, so at 33MHz clock.

Well yeah, uncached is much slower then cached access, and this meets my subjective observations, when i disabled cache completely when i worked on some software IPC fifo.

:edit: i just noticed why my data cache clearing does no effect at all. ... Guess that i should call an invalidate after the flush *grr* i'll do measures of cachemisses tomorrow when i fixed that.

gbadev.org forum archive

DS development > Some surprising timings

#158208 - Maxxie - Sat Jun 07, 2008 1:15 pm

#158209 - eKid - Sat Jun 07, 2008 1:34 pm

#158211 - Maxxie - Sat Jun 07, 2008 1:57 pm

#158253 - Maxxie - Sun Jun 08, 2008 11:25 am

#158261 - eKid - Sun Jun 08, 2008 1:10 pm

#158262 - Maxxie - Sun Jun 08, 2008 1:22 pm

#158266 - Cearn - Sun Jun 08, 2008 4:01 pm

#158371 - edwdig - Mon Jun 09, 2008 8:28 pm

#158373 - Maxxie - Mon Jun 09, 2008 8:45 pm