#161072 - Atreides - Sun Jul 27, 2008 10:04 pm
Hi All,
I have been writting 3D engines in asm since the 386 but havent had much time or opportunity to develop a from scratch 3d engine for several years.
Ther opportunity to develop a 3d engine for the DS has come up and I have been looking through the tech specs and have a few questions that would help me in the design phase while I work out how I want to implement my engine.
Firstly, I am curriouus about the 3D core. The quality of the renderer is poor and I dont want to use it, but is it a seperate chip or is it firmware that that utilises the ARM9 ?
Could I fill the data structs for the 3D core directly and force a render flush more than once per frame ?
Is it possible to read back 3D transformation data for vectors and matrices from the 3d core ?
I will probably use an incremental scan line renderer that implements gourad shading, phong lighting and texture mapping and I will be using 15bit colour modes. Any suggestions for what internal memory is best for scanline raster buffers that can be copied to vram, working data areas for span converted triangles and the like ... ?
For a vram scaline copy can I set the scan line address that I want to block copy the video buffer to. This would mean that I could ignore the vertical refresh and not have to sync to it. Is this possible (and please let me know if the question doesn't make sense) ?
thansk for taking the time to read this and I look forwards to your replies
regards
Leo
#161074 - kusma - Sun Jul 27, 2008 10:14 pm
Atreides wrote: |
Firstly, I am curriouus about the 3D core. The quality of the renderer is poor and I dont want to use it, but is it a seperate chip or is it firmware that that utilises the ARM9 ?
|
Separate chip / gates. It does not affect the ARM9's performance. But I must say - not using the GPU ties up a LOT of CPU cycles that could be spared.
Quote: |
Could I fill the data structs for the 3D core directly and force a render flush more than once per frame ?
|
No.
Quote: |
Is it possible to read back 3D transformation data for vectors and matrices from the 3d core ?
|
Yes, to a certain degree. Check GBATek for details.
#161075 - silent_code - Sun Jul 27, 2008 10:38 pm
Saying that the quality of the images generated by the 3D hardware is poor, is a bit over the top. I actually find it has a very good quality for the hardware's specs. It's all up to the developers to use it propperly.
How come you think that way about the 3D hardware?
In addition to kusma's post:
The hardware is "sealed off", controlable through a command FIFO and a few mode and status registers. It is not part of the CPU accessible bus.
You can write directly to VRAM (to offscreen lines), if you want. I read that memory is very fast.
But iIrc, VRAM can only be written to during VBlank.
Flushing 3D means waiting to the end of the frame (only for 3D rendering, everything else will still work, e.g. the ARM program).
I don't think "not synchronizing" is doable (s.a.) and is definitely not recommended, because the console (as nearly all consoles are) is BASED around the 60 hz display rate.
You also have to remember, that features like anti aliasing and fogging (any others as well) come at no cost when using the hardware. You merely switch things off than on. Performance impact is always 0. (There are other implications, but I am refering to the sole rendering side.)
Anyways, I share your enthusiasm for software rendering and I am definitely interested in seeing what you will achieve. :^)
And welcome the the forum! :^D
Good luck!
_________________
July 5th 08: "Volumetric Shadow Demo" 1.6.0 (final) source released
June 5th 08: "Zombie NDS" WIP released!
It's all on my page, just click WWW below.
#161079 - DensitY - Sun Jul 27, 2008 11:17 pm
In may I wrote up a software rendered based model renderer (affine texture mapper, Zbuffered, backface culled, 1 directional light), let me tell you that arm9 isn't overly fast for the task. I generally found the quickest I could get my test model (240 triangles) to render was about 4ms. Transformation and translation to 2d screen space was pretty quick, under 1ms, but actual texture mapping process, building you texture mapping gradients for the S,V coords was not so quick.
I didn't go all out however, While I initially have all my code Software based, I did end up using the DS's 3d hardware for matrix by vertex multiplication and the DS's divider hardware todo division. however I didn't expliot the fact that the hardware is async from the arm9 (ie, I could of continued onto the next vertex for translation while I spend 15 cycles waiting for the current vertex to finish its matrix by vertex multiplication).
considering the natural screen size of the DS, the ds's 3d hardware is pretty good. the lack of additive blending is annoying however its reasonably good and considerby faster then writing a software renderer for the arm9.
You could possibly do a Doom 1/2 rendering system and get some reasonable frame rates, ie fixed Z coord per column and have a floodfill like span rendering for floor/ceiling gaps, allowing you todo a quick fast DDA inner rendering loop. but yeah, software rendered based polygonal engine is best left done by the hardware.
Quote: |
For a vram scaline copy can I set the scan line address that I want to block copy the video buffer to. This would mean that I could ignore the vertical refresh and not have to sync to it. Is this possible (and please let me know if the question doesn't make sense) ? |
like DOS, you can render to say 0xA000 (PC mode 13 vram address). with the DS you set a screen to framebuffer mode then just write to VRAM_A memory location.. however unless you write to that memory bank after vsync interrupt, you'll get flickering.. you best bet there is to simply double buffer, render to a backbuffer in ram (or DTCM if you want most of your renderer to sit in ITCM memory), then copy that over to Vram on Vsync interrupt call to avoid flickering.
#161082 - DekuTree64 - Sun Jul 27, 2008 11:58 pm
I've considered such a project myself in the past. It might not be too practical, but it would be loads of fun to write. And you do have a whole 33MHz ARM7 as well, so you could just run the main game code over there and use ARM9 as the "sub processor" :)
Texture filtering would be cool to see, and maybe some voxel terrain just because you can't mix it with hardware 3D easily. Also additive blending. And if you plan to use it in any demo competitions, throw in some bump mapping too.
You could use the 3D chip to do matrix multiplies and vertex transformations. The position test is a hardware matrix-vector multiply. You could probably do a nice 3-stage pipeline where the 3D chip transforms a vertex, the divider calculates the reciprocal of w, and the CPU multiplies by that reciprocal. Position test and division are independent of the CPU and eachother, so all 3 stages would be running at once (on 3 different vertices).
I'm not sure exactly what you mean by an "incremental scanline renderer" though? Maybe a span buffer renderer? That could probably reach higher speeds than a double buffered renderer with Z-buffer, but it also makes alpha blending harder, and you really have to keep your per-span info small for it to be effective, which makes it less flexible.
DTCM (16KB) is your best bet for small pieces of data that need a lot of reading and writing. The 2 16KB shared work RAM banks are good too. They're slower than DTCM, but still very fast. Or you can use them to transfer data back and forth between processors, or just give them to ARM7 so it has more memory that won't interfere with ARM9.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#161084 - DensitY - Mon Jul 28, 2008 12:05 am
DekuTree64 wrote: |
You could use the 3D chip to do matrix multiplies and vertex transformations. The position test is a hardware matrix-vector multiply. You could probably do a nice 3-stage pipeline where the 3D chip transforms a vertex, the divider calculates the reciprocal of w, and the CPU multiplies by that reciprocal. Position test and division are independent of the CPU and eachother, so all 3 stages would be running at once (on 3 different vertices). |
I actually wrote a matrix * vertex transformation code, and posted it here http://forum.gbadev.org/viewtopic.php?p=157517&highlight=#157517 although it can be greatly improved, by implementing it in stages as you said.. possibly something he could look into.
Quote: |
DTCM (16KB) is your best bet for small pieces of data that need a lot of reading and writing. The 2 16KB shared work RAM banks are good too. They're slower than DTCM, but still very fast. Or you can use them to transfer data back and forth between processors, or just give them to ARM7 so it has more memory that won't interfere with ARM9. |
This. (Although with my implementation I had data/function statically stored in DTCM/ITCM, dynamic data storage is best).
#161151 - TwentySeven - Tue Jul 29, 2008 12:36 pm
ITCM is the same speed right?
#161175 - DensitY - Tue Jul 29, 2008 7:43 pm
ITCM (instruction cache), and DTCM (data cache) runs at 66mhz, sits off next to the the arm9. The shared work ram banks, as far as I remember sits off the 33mhz bus, so shifting data to the ram is slow but the data is accessable by both processors. don't actually remember the speed of the work ram, I assume they are slower, esp when it comes to communication between it and the arm9's internal caches.