gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Hardware > ROM prefetch question

#1775 - col - Wed Jan 22, 2003 11:08 pm

when the rom prefetch is switched on:

If i'm running code in iwram, and accessing sequential data in rom, will prefetch work even if the rom reads are seperated by instructions in iwram?

cheers

col

#1799 - ampz - Thu Jan 23, 2003 8:56 am

I'd say so, yes.
Makes little sense otherwise...

Oh, always do sequential ROM acesses with a _increasing_ adress pointer. If you use a decreasing pointer you'll get the worst case access time for every read.

#2127 - Paradroid - Wed Jan 29, 2003 10:46 pm

...sorry, but it won't go faster...

If you're reading data from ROM the prefetch setting won't make any difference. Only CPU instructions go through the prefetch. If you've got your code in WRAM then ROM data accesses will be the same...

#2129 - Splam - Wed Jan 29, 2003 11:06 pm

This from the Nintendo manual

When the Prefetch Buffer Flag is enabled and there is some free space, the Prefetch Buffer takes control of the Game Pak Bus during the time when the CPU is not using it, and reads Game Pak ROM data repeatedly. When the CPU tries to read instructions from the Game Pak and if it hits the Prefetch Buffer, the fetch is completed with no wait in respect to the CPU. If there is no hit, the fetch is done from the Game Pak ROM and there is a wait based on the set wait state.

If the Prefetch Buffer Flag is disabled, the fetch is done from the Game Pak ROM. There is a wait based on the wait state associated with the fetch instruction to the Game Pak ROM in respect to the CPU.

#2133 - tepples - Wed Jan 29, 2003 11:26 pm

Splam wrote:
This from the Nintendo manual

Isn't quoting Nintendo manuals an NDA violation?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#2134 - ampz - Wed Jan 29, 2003 11:29 pm

Paradroid wrote:
...sorry, but it won't go faster...

If you're reading data from ROM the prefetch setting won't make any difference. Only CPU instructions go through the prefetch. If you've got your code in WRAM then ROM data accesses will be the same...


I think you are confusing the GBA cart bus prefetch buffer with the ARM7 pipeline fetch stage.
That's two totally different prefetch buffers.

As Splam so nicely quoted from the (illegal?) doc, it will speed up any sequential access from the cart bus.

EDIT
tepples: The unfortunate fact is that most of the info available probably comes from that doc from the beginning anyway. The dangerous thing is that there is no way of knowing where the info comes from..
I'am at least happy to know that the hardware info I have used to produce GBA cart designs comes from a 100% legal session with a logic analyser. :)

EDIT again
Oh, does anyone know how deep the prefetch buffer is? Some logic analyser data I have suggests it might be 4 words deep (4 * 16bit).

#2137 - Splam - Wed Jan 29, 2003 11:53 pm

tepples wrote:
Splam wrote:
This from the Nintendo manual

Isn't quoting Nintendo manuals an NDA violation?


Who said I signed an NDA? :P

#2152 - tepples - Thu Jan 30, 2003 5:40 am

Splam wrote:
Who said I signed an NDA?

Then how'd you come upon the manual?

I work only from the CowBite spec.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#2154 - Costis - Thu Jan 30, 2003 7:44 am

Hi,

Yes, we'd rather you don't post from any illegal\official documents or other stuff here, please. Safety and moral issues involved.

Costis

#2156 - ampz - Thu Jan 30, 2003 9:05 am

tepples wrote:
Splam wrote:
Who said I signed an NDA?

Then how'd you come upon the manual?
I work only from the CowBite spec.

Where do you think the cowbite spec. comes from, originally?

#2159 - Paradroid - Thu Jan 30, 2003 10:12 am

Ampz, Cols question was if the ROM prefetch is on will it help with sequential access when code is running in WRAM - and according to the Nintendo doc's(above) it won't as it is a CPU cache (of 3x16-bit words). The section just below the bit Splam copied explains the wait states for the ROM and it's this that helps with data access (and naturally instructions aswell as it's a general ROM setting).

What you need is to set the wait-state settings for the ROM as the recommended 3:1:1:1 (ie 3 waits for the first access and then 1 wait for sequential accesses after that). By default it reads as 4:2:4:8 - your whole code goes a lot faster when this is enabled correctly. And in the games I wrote I never found any harm setting the ROM prefetch to on.

#2180 - tepples - Thu Jan 30, 2003 4:24 pm

ampz wrote:
Where do you think the cowbite spec. comes from, originally?

I'd assume it came from the same place the publicly available NES documents came from: by running tests on the hardware and by documenting what you did to get dumped ROMs emulated.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#2183 - col - Thu Jan 30, 2003 5:26 pm

Paradroid wrote:
Ampz, Cols question was if the ROM prefetch is on will it help with sequential access when code is running in WRAM...


er.. actually Cols original question was:

If the code is running in/from IWRAM, and is _reading_ sequential _data_ from ROM, will those rom accesses be able to benifit from prefetch even though they are seperated by chunks of IWRAM code?

Another question:
As (i now gather) the prefetch buffer is 8 bytes, can repeated reads from the *same* address benefit from pre-fetch, and can I skip a word or 2 words and still benifit from pre-fetch?, (and how long IS a piece of string?!)

Is it better (pre-fetch wise) to put rom reads together, or is it better to space them out:

ldr r1, [ROM]
ldr r2, [ROM+1]
ldr r3, [ROM+2]

or

ldr r1, [ROM]
some iwram code
...
ldr r2, [ROM+1]
some more iwram code
...
ldr r3, [ROM+2]
...


cheers

col.

#2185 - Paradroid - Thu Jan 30, 2003 6:18 pm

Spacing WRAM instructions out (as in your example) will be faster if you set register 0x04000204 to 0x14 - one instruction spacing (ie 1 cycle) should be enough. The wait states work with sequential accesses so if you skip 2 bytes it's the same as random access - which is a 3 cycle penalty.
Not sure on the prefetch buffer being 8 bytes - there is a CPU instruction prefetch from ROM of 3 instructions but if your code is in WRAM and you're reading data from ROM this won't get used.

Easiest thing is to try the variations out on real hardware and time them...?

#2190 - FluBBa - Thu Jan 30, 2003 10:48 pm

Just a little question..
Have they really made the prefetch buffer only buffer data when the CPU fetches instructions but not normal data reads? I don't say it's hard to do or stupid, but would really like to hear someone say one good reason to do so. And if anybody knows how they see the difference of a read vs a read from memory.
Or am I totally out of my mind here ;-)

/FluBBa

#2194 - ampz - Fri Jan 31, 2003 1:01 am

Don't confuse the CPU pipeline fetch stage with the ROM prefetch buffer. It's two totally different things.

Yes, the ROM prefetch will help all sequential ROM reads.

#2218 - Maddox - Fri Jan 31, 2003 6:04 am

ampz,
Sorry. The prefetch flag only affects instructions.

When the memory subsystem has time and access to the bus, it prefetches INSTRUCTIONS. "Prefetch" is even a word from the computer science lexicon and it refers to the pipeline stage of a processor. You see, ampz, processors used to just "fetch" instructions, but now they "prefetch" before they need it 'cause it's better that way. (This would have been easy for nintendo to add to the system as part of the ARM cpu core itself.) This idea is further strengthened by the fact that, as certain docs say, IF the prefetch buffer does not hit, the WAIT STATE is THEN imposed on the instruction fetch.

The sequential ROM accesses are shorter because the nintendo carts have address counters on them that autoincrement every time they are strobed to do so. When an address does not match, instruction or otherwise, the counters are relatched and the data must wait for the wait state amount before the data is valid on the bus. The cpu prefetch IS taking advantage of this to do the prefetching at all, otherwise there probably would never be enough free bus time to do the prefetch.

Maddox strikes again!
_________________
You probably suck. I hope you're is not a game programmer.

#2219 - tepples - Fri Jan 31, 2003 6:12 am

Maddox wrote:
When the memory subsystem has time and access to the bus, it prefetches INSTRUCTIONS.

Correct. My tests on hardware show that setting the prefetch bit does NOT speed DMA copying from ROM to RAM. Thus, the maximum data bandwidth of the cartridge interface is (16777216 cycles/s) / (3 cycles/transfer) * (2 bytes/transfer) = 10.66 MiB/s (11184810 bytes/s), not 16 MiB/s as I have previously erroneously claimed on this board.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#2223 - ampz - Fri Jan 31, 2003 10:51 am

tepples: The maximum bandwidth of the cart interface is 16777216 bytes/s (16MB/s) if the waitstate is set to 2, and that's the default waitstate setting for all commercial games.

Understand that the cart prefetch buffer would not speed up DMA transfers. The cart bus is separated from the internal bus, and thus the DMA transfers can utilize the full bandwidth of the cart bus when doing a transfer to internal RAM.
If you want to test if the buffer speeds up data transfers, then you have to make a loop running in internal RAM that copies data from the cart to internal RAM.

Maddox: I know how GBA carts works. I build them, and I'am also the one who released the full cart bus spec. on the net in the first place. I'am also very well aware of computer architechtures.
You are still confusing the cart prefetch with the CPU pipeline prefetch stage. It's two totally different but coexisting prefetch buffers.

Just because it's called "fetch" doesn't mean it's a intruction fetch, it may just as well be a data fetch.

#2225 - zeuhl - Fri Jan 31, 2003 11:24 am

just posting this message because nobody told about ldm**/stm** ARM instructions.

the instruction : "ldmia r0!,{r1-r12}", for example, loads 12 words (ie, 48 bytes) from the address pointed by r0 and stores them to r1,r2,.....,r12. then it adds 48 to r0, in order to enable another consecutive load.

taking the ROM prefetch buffer into account, will this instruction be faster/slower/equivalent to a bunch of twelve "ldr r1,[r0],#4" consecutive instructions ? (obviously, supposing each instruction would load data to a different register)

and which is the fastest ?
- a bunch of ldmia/stmia instructions for copying data from ROM to IWRAM, or
- a bunch of ldr/str instructions ?
thanks for your help.

#2235 - FluBBa - Fri Jan 31, 2003 2:44 pm

tepples wrote:
Maddox wrote:
When the memory subsystem has time and access to the bus, it prefetches INSTRUCTIONS.

Correct. My tests on hardware show that setting the prefetch bit does NOT speed DMA copying from ROM to RAM.

How could it have done? It's only usefull when working with the CPU.
It only allows you to do things _between_ the "fetches" of data.
If this thing works for data also and not only instructions it will give you a nice speedboost for things like...
ldr r0,[ROM]
add r0,r1,r2,lsr#16
sub r0,r0,r3
str r0,[RAM]
ldr r0[ROM+1]
.....

If the prefetch is on you will get 0(?) waitstates on the second ldr...
Instead of letting the processor wait for the waitstate the prefetch buffer does it for you, but if you do [buffer deep] number of consecutive reads the buffer will be empty (or soon to be empty) and then the wait states will kick in again.

#2243 - tepples - Fri Jan 31, 2003 3:29 pm

ampz wrote:
tepples: The maximum bandwidth of the cart interface is 16777216 bytes/s (16MB/s) if the waitstate is set to 2, and that's the default waitstate setting for all commercial games.

But on my hardware, a DMA copy still takes two cycles to read each 16 bits from ROM and one cycle to write them to IWRAM, even at 3/1 wait state timing. The cart prefetch hardware seems not to fetch data from the cart while the DMA hardware is writing to IWRAM.

Quote:
If you want to test if the buffer speeds up data transfers, then you have to make a loop running in internal RAM that copies data from the cart to internal RAM.

How would this look? A 'ldmia' instruction would read so much data from ROM that it would empty the cart prefetch buffer rather quickly, right?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#2249 - ampz - Fri Jan 31, 2003 6:30 pm

That's kinda strange..