gbadev.org forum archive

You know I'm basically a newbie of ARM asm...
I've read on ARM946-E tech sheets that it implements the PLD instruction which should tell the data cache that quite soon we will access some piece of data which is stored at a given location and the memory unit -if it's possible- should load the cache line so that when we will need the data it will be already cached.
It looks quite simple, IMHO.
I have that quite long array of words (say 16000) in main ram and I need to process -not simply memcopy- each of them sequentially. So my code looks like:

Code:

ldm <some of the words>
pld <some of the words I'll need later>
<process the data>
<process the data>
<process the data>
...
<process the data>
stm <the results>
(if !done) b <back>

In my tests I couldn't make such code go any faster than the same version without the pld instruction. I've been also trying to move the piece of code to ITCM so that even the code itself it isn't in main ram, to give plenty of access to the ldm instruction. No luck.

Any idea?

Thanks :)

Code in ITCM runs at the same speed as code in the cache, there's no advantage to moving code there unless either your loop exceeds the size of the instruction cache. I suspect there might be some advantage if the code is called from many places and somehow it's been expelled from the cache between calls - I haven't been able to write code to test that though.

There are other ways to improve access speed for data processing - for instance if your data is a list of x, y and z co-ordinates then you'll get more improvement from having an array for each element instead of interleaving them in one structure. This way you make use of 3 cache lines during processing instead of just one (assuming each array is bigger than 32 bytes).

The ARM Architechture Reference Manual says that the PLD instruction is a hint and acts as NOP on memory systems that don't support the optimisation. Given that Nintendo didn't add the necessary interface to provide byte writes to GBA cart space & VRAM I think we can probably safely assume PLD won't work if it needs some support from the memory architecture. Feel free to disprove this though :p
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

PLD is an optional instruction in ARM9, and it is very common for implementations to remove it. I suspect this is the case for the ARM9 in the DS as well.

When writing software that targets multiple ARM implementations, I've inserted PLDs where I'd get a register-stall anyway (e.g when I have to read a register in the instruction after one that writes it). That has helped on the chips that do implement PLD without causing any harm on those who does not.

wintermute: I tried moving the code to ITCM just to make sure that the opcode fetches wouldn't interfere with cache/main memory access, not to make the code faster, as it's just few bytes and it will be cached easily, of course :)
Data interleaving is an interesting hint, even if I really have just a simple word array. I'll think about it.
kusma: I read that PLD is an optional instruction, but unfortunately the only way I know for checking if it has been implemented or not in the DS is to test it. I still can't make it work, but I still can't assume it hasn't been implemented...

PLD is vital to to the 3DS on its ARM chips, but does nothing on the regular DS.

Miked0801: ... and I believe DSi is just like regular DS, or do they actually added PLD support there? I'm running my tests on a DSLite because it's what I've got ATM.

I've found on the net this piece of code (from "Android Open Source Project"):

Code:

bic r12, r1, #0x1F
ldr r3, [r12], #32 /* cheap ARM9 preload */

which looks like it's aligning r12 to a cache line (8 words) and it's loading the first word from there. It makes me think that the ARM9 just waits the word he needs and that the next 7 words are loaded into the cache 'later' (thus building a kind of 'read ahead')

In my tests this doesn't seem to happen on DS. It's again a different ARM9 implementation?

DSi has same memory controller/CPU as the DS, jus clocked higher. No PLD there needed either. Again, the 3DS is the first handheld that needs it from Nintendo.

what do you mean by "needing" PLD?
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

It actually has a real (and quite small) cache. And from personal experience, doing a memcpy without the cache being properly pipelined via PLD caused a large, nearly catastrophic slowdown in performance.

From what I've read around it seems like that the ARM926 can

Quote:

stall only until the requested world is fetched, but the linefill continues in the the background

... but this seems to be not the case for ARM946.
According to ARM website, the 926 has a MMU and the 946 has no MMU (there's a MPU, which is something different) so I guess I have to forget about any kind of preload. Pity. Oh well, ok. :|

edit: I finally found this:

Quote:

The ARM946E-S does not perform critical word first loading on cache line fills, [...]

here, or ARM website. Just to let you know.

gbadev.org forum archive

ASM > Anyone actually ever used PLD?

#176604 - sverx - Wed Aug 24, 2011 9:43 am

#176605 - wintermute - Wed Aug 24, 2011 12:20 pm

#176606 - kusma - Wed Aug 24, 2011 12:25 pm

#176607 - sverx - Wed Aug 24, 2011 12:47 pm

#176609 - Miked0801 - Wed Aug 24, 2011 9:28 pm

#176610 - sverx - Thu Aug 25, 2011 8:35 am

#176611 - sverx - Fri Aug 26, 2011 10:55 am

#176637 - Miked0801 - Sun Aug 28, 2011 1:47 am

#176640 - wintermute - Sun Aug 28, 2011 10:25 am

#176645 - Miked0801 - Mon Aug 29, 2011 4:05 am

#176646 - sverx - Tue Aug 30, 2011 9:49 am