#108761 - Mighty Max - Sat Nov 11, 2006 10:41 pm
Heya,
long time i had no real posting, here it comes again :D As some of you know, i often try things out, i hear courses in uni. This semester im hearing parallel processing ;) [more exact: im having a parallel processing pratica]
I have uploaded a early version of a MPI (message passing interface) library for the DS here together with an demo binaries only
Nothing spectacular, but a uniformed way to utilize the power of both processors. Implemented at the current state are blocking transfers, nonblocking recv. Barriers and of course the initialisation routines.
The demo demonstrates the basic use of the underlying function as well as a mandelbrot renderer which first runs on arm9 only and then on both processors (weighted: 2 lines the arm9 while the arm7 does one line). It shows a nice speedgain.
The mandelbrot routines are "native" there was no speedoptimize on the algorythm so forgive the times, they even use floats *evil*. The speedup is the interesting thing *g*
Greets
Mighty Max
[edit] You can find information about the MPI standard here
[edit2]The timer runs showed for the madelbrot are the number of overflows of a 16bit DIV_64 timer
[edit3]Started the documentation (really only started, dont expect much) in this PDF
_________________
GBAMP Multiboot
Last edited by Mighty Max on Wed Nov 15, 2006 2:31 pm; edited 4 times in total
#108778 - Payk - Sun Nov 12, 2006 1:03 am
Hey nice thing Mighty :D
You are alway interested in those abstract things, right?
So on arm9 only: 271 mu's
Using both cpus: 183 mu's
Whats MU? Mighty Unit ;P
Anyway nice job, well done!
Its an interesting thing and i am wondering what all can be done by using both cpus...
#108840 - OOPMan - Sun Nov 12, 2006 8:17 am
Very very cool. I always wondered whether someone would try using the DS's 2 processors for some parallel processing related purposes. It's nice to see that speed gains are very much possible :-)
Now all we need is to build a Beowulf cluster over WiFi ;-)
_________________
"My boot, your face..." - Attributed to OOPMan, Emperor of Eroticon VI
You can find my NDS homebrew projects here...
#108875 - tepples - Sun Nov 12, 2006 3:50 pm
Running video on one CPU and sound on the other is already parallel processing, and even MoonShell does this.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#108881 - Mighty Max - Sun Nov 12, 2006 5:06 pm
Indeed tepples. However the level of parallization i found yet is pretty low.
Seperating jobs is a pretty easy task, actually both CPU's will solve a standalone problem. MPI is usually the way to solve one problem on multiple systems.
I'm targetting at the standard rather then "look its possible". The later is well known.
What is done atm if you need to add a new job to the arm7? Atm every homebrew developer writes his own communication "add-on" if he needs to transfer something to the other cpu.
The IPC struct has its limits. Access to it are mainly non-atomic, therefor it might corrupt. Sending sound commands is likely to cause timing problems, due to the only-on-vblank limitation.
I don't see the need everyone has to invent ipc again everytime something needs to be added. When it could be done as simple as Send(this to there).
_________________
GBAMP Multiboot
#108898 - tepples - Sun Nov 12, 2006 9:25 pm
Yes, a standard method of communication between the CPUs, embodied in a highly documented example, would be nice to have.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#108899 - OOPMan - Sun Nov 12, 2006 9:42 pm
libTalkingProcessors :-)
_________________
"My boot, your face..." - Attributed to OOPMan, Emperor of Eroticon VI
You can find my NDS homebrew projects here...
#108901 - Payk - Sun Nov 12, 2006 10:35 pm
^^Hehe...but not that fast...
Fisrt TalkingProcessorsLib...and later LibTalkingProcessors (like libfat,libnds...etc)
Seems to be fashion :D
#108907 - Hermes - Sun Nov 12, 2006 11:28 pm
Hi
My Multithread library have some methods for parallel execution using the two processors.
For example, int ARM7 main() some templates have a infinite loop to do nothing. In my case, i can connect a function from the ARM9 section code to work from the ARM7.
Also is possible to create a remote thread in the ARM7 or register a RPC server (the RPC method can work in asynchronous form, of course)
The Mod Player, is working too from the ARM9 section, in this case, using a export table.
But, the problem: ARM7 is not a powerfull processor and some others process eat too much time. You only can think to use the dead time between processes
Here some things as RTC (the clock) in my case works from a thread and the interrupts can work well but using others libs, the interrupt are disabled (it work from the interrupt handler).
Anyway it eat time and it is normal to see a parallel thread working with some time irregularity.
In my opinion: ARM7 is a subprocessor and it is designed to work with the peripheral and other smaller things (and the smaller things cannot have the priority and it have irregularities in the execution)
Here i post an example of my library (the new) but you can see this example in my old multithread library too.
http://mods.elotrolado.net/~hermes/mthread_example2.rar
In the example the Modplayer and the fire generation is working in the ARM7 from the ARM9 section. The ARM9 only display the result in the upper screen and update the time and little more
My library have a complete documentation.
#108910 - masscat - Mon Nov 13, 2006 12:10 am
I am getting timings of 271 for ARM9 only and 183 for both CPUs. So assuming that the ARM9 is twice as fast as the ARM7 (clock speedwise) then this example is getting pretty much 100% usage of both ARMs, very impressive (or am I miss understanding the timings).
Note for any wmb for linux users: wmbhost will moan about example.nds not being "a good .nds file". This a is bug in version 1.30 of ndstool which makes .nds files that are shorter than they should be (see here for first mention and has been fixed in cvs).
Edit: the fix for ndstools has not been in a release of devkitarm, as of release r19, but only in the sourceforge cvs.
Last edited by masscat on Mon Nov 13, 2006 1:32 am; edited 1 time in total
#108913 - Mighty Max - Mon Nov 13, 2006 12:39 am
I'm well aware of this Hermes
(good work on the multithreading)
Actually the multithreading adds another 'need' onto the MPI, as there might be more then those 2 execution streams (from arm7 or arm9) that might want to work together this way.
Both things work at a different abstraction level.
While a thread that is able to run on both CPUs needs to use the common base (opcodes and coprocessing features that exists on arm9 only, cant be used when running the thread on arm7), are the codes for both cpu's seperate compiled and optimized on each processor in MPI.
Where it doesnt matter if the other processing streams (may it be threaded , interrupted or continous) is on the same cpu, on the same machine, or just in the same world.
Provided a communication mean (here wifi) it allows parallel processing on machines that are completely different. I.e. coprocessing with a x86 server, a PSP, GP32 or whatever you can imagine.
In other words the co-running streams are platform independend. One linux node can run with a windows node and a OSX node, or a 1936er turing machine works together with IBM's BlueGene.
@masscat,
looking independendly (ignoring what we know about the cpus), the speedup is around only 1.48. So there is a noticeable overhead tradeoff (we would expect 1.5 as the perfect number). However because we know this speeddifference the code was designed to optimize that way.
With the implemention of non-blocking sends and request testing parts of the MPI protocol it wouldnt even matter if we know the clock-ratio /processing power. Might even add a promile on the speedup ;)
Btw if you change the code of the example that the line-ratio is 3(arm9) : 1 (arm7) the timings change to 271 vs 204
(For the nds problem ... guess i gotta update my devkit again :D ... long time that i didnt do anything)
_________________
GBAMP Multiboot
#108914 - HyperHacker - Mon Nov 13, 2006 12:50 am
For parallel processing I wrote up a simple FIFO messaging system. It only blocks while the FIFO is full which should be a fairly rare occurance since each CPU has the FIFO Not Empty interrupt enabled. They do a bit of passing function pointers back and forth at boot, it probably wouldn't be difficult to dynamically pass pointers and parameter info at runtime. If I ever finish the project you can grab a copy from the source.
_________________
I'm a PSP hacker now, but I still <3 DS.
#108917 - Mighty Max - Mon Nov 13, 2006 1:14 am
Thanks HyperHacker, but this part is done. (Only exchanging pointers is no option, as the DS is a NUMA, not all memory locations are accessible by all cpus)
I gotta implement the irq fifo send is empty for the nonblocking sends next.
_________________
GBAMP Multiboot
Last edited by Mighty Max on Mon Nov 13, 2006 1:22 am; edited 1 time in total
#108918 - Mighty Max - Mon Nov 13, 2006 1:21 am
arg wrong button - delete me pls
_________________
GBAMP Multiboot
#109141 - Mighty Max - Wed Nov 15, 2006 2:42 pm
Documentation started (see first post)
Lib updated. Too many changes to name them all. Many of them are to be more compliant to the mpi standard. The main change however is a way to add new communication means other then fifo (even at runtime, as emulators might not like fifo irqs)
Barriers are not yet included in the commchange. Thats coming up.
[edit] There is a lil internal problem, when multible nonblocking receives are initiated in the uploaded lib. This will be fixed next time (is allready in the work version)
nonblocking recv's fasten the mandelbrot to 182 timeroverflows in mpi mode. It is now visible observeable, that the arm7 is finished faster with the it lines then the arm9 :D
Demo in nonblocking mode
[edit2]the demo was updated due to deadlocks. There was a problem when both cpu were in irq mode trying to send. They would abort and try again after a time. As the last access to fifo was synchron, they retried at exactly the same time, so they deadlocked in a retry & abort. Fixed. Arm9 will wait a very 'lil before retrying now.
_________________
GBAMP Multiboot
Last edited by Mighty Max on Wed Nov 15, 2006 11:08 pm; edited 2 times in total
#109143 - OOPMan - Wed Nov 15, 2006 3:59 pm
Nice to see this is moving forwards. I'm going to be updating the devkitPro liblist again today, so hopefully I'll reach the point that gets yours added in (Alphabetical progression does have its downsides... ;-)
_________________
"My boot, your face..." - Attributed to OOPMan, Emperor of Eroticon VI
You can find my NDS homebrew projects here...
#109227 - HyperHacker - Thu Nov 16, 2006 6:55 am
Mighty Max wrote: |
Lib updated. Too many changes to name them all. Many of them are to be more compliant to the mpi standard. The main change however is a way to add new communication means other then fifo (even at runtime, as emulators might not like fifo irqs) |
I really wish people would stop trying to make their apps work on broken emulators and start fixing the emulators.
_________________
I'm a PSP hacker now, but I still <3 DS.
#109241 - tepples - Thu Nov 16, 2006 8:37 am
For one thing, skills in application programming and skills in emulator programming don't really match, and for another, some popular emulators are proprietary software. I've made a tiny .nds test case that hangs NO$GBA, yet not being a paying customer, I haven't got a reply to the e-mail that I sent to Mr. Korth.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#109263 - masscat - Thu Nov 16, 2006 4:30 pm
HyperHacker wrote: |
Mighty Max wrote: | Lib updated. Too many changes to name them all. Many of them are to be more compliant to the mpi standard. The main change however is a way to add new communication means other then fifo (even at runtime, as emulators might not like fifo irqs) |
I really wish people would stop trying to make their apps work on broken emulators and start fixing the emulators. |
Seperating the FIFO inter-ARM communcications mechanism from the library is good practise as it is not a requirement of the MPI implementation (the particular FIFO implementation not the need for a means for the ARMs to talk with each other). Defining the ARM comms interface (with a fifo example implementation) makes the library more flexible and adaptable for different situations.
#109282 - Mighty Max - Thu Nov 16, 2006 8:47 pm
The libs have been updated, as well as the example's sourcecode for the changes that have been done to the mpi/mpic seperation so far and again error-casting & naming. Nonblocking sends have been added.
The measuring has changed in the example to be more precise (one timer run is now ~1/503 secs).
Another example has been added: Sorting an array of floats.
Mandelbrot isn't something you normaly use in your programs, so this should demonstrate that common problems benefit from this too.
Two identical arrays of 100k floats will be sorted a) on one processor in quick_sort and b) on both via weighted merge_split_sort. (A shared memory is used, but is not necessary)
a) 2.54 secs b) 1.94 secs
_________________
GBAMP Multiboot
#110910 - Mighty Max - Sat Dec 02, 2006 8:06 pm
Heya, just wanted to give a lil' Status update.
I was trying to get wifi working together with the MPI library as another communication mean (to coprocess with multiple DS or other devices) i came across a major problem. (when forwarding the messages received via fifo)
The last released version of the lib does sends to other transfers in irq mode when no recv request was posted before. Otherwise it just completed the transfer. However, when timings come bad, both sides might get into the state of sending, where the send fifo gets full. Both will wait for the other to empty their receive buffer.
This wouldnt be much of a problem, if there weren't messages that cause a send on receive themself. One of them is a message for barriers that returns the remote barrier level. This creates another send that will wait until the other is done receiving ... and while we send our receive went full again because the other side did the same ... not to mention, that mutex owned by the main program will prefent the irq to receive the mutex unless it returns.
This problem can only be hotfixed by recursation level controls and flags and ... but that doesn't solve the problem really, it just creates more harder to track problems. So the problem is a design issue.
I decided to change the way how receives/sends are called. This is a change on the underlaying structure, so that it will take some time. I'll keep everyone interested up to date.
_________________
GBAMP Multiboot
#110962 - HyperHacker - Sun Dec 03, 2006 7:36 am
tepples wrote: |
some popular emulators are proprietary software. |
If they're not willing to fix bugs or let others fix them, then forget them. There's open-source emulators, or you could write your own.
_________________
I'm a PSP hacker now, but I still <3 DS.