gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > glRestoreMatrix - Am I understanding this correctly?

#139182 - DiscoStew - Mon Sep 03, 2007 6:32 am

Currently in my code, I have some functions that have been relying on pushing and popping the matrix, with matrix manipulations in between as so....

Code:

*** loop start ***

glPushMatrix();
*** do calculations ***
glPopMatrix(1);

*** loop end ***


But, I've been looking into optimizing my overall code for displaying 3D models, as the push takes 17 cycles and the pop takes 36 cycles (at 33 Mhz units). When diving through the spec sheet, I came across glRestoreMatrix, which if I am correct copies the matrix from the stack onto the current matrix. So, by moving some code around, I came up with this...

Code:

glPushMatrix();
LastMatrix = ((GFX_STATUS >> 8) & 0x001F) - 1;

*** loop start ***

glRestoreMatrix( LastMatrix );
*** do calculations ***

*** loop end ***

glPopMatrix(1);



On No$GBA, it works out just fine, but on hardware, everything is screwed up. Of course emulation of the DS isn't perfect, and I don't know if this specific function is correctly working in the first place on the emulator (would be funny that a function not working correctly actually makes stuff look correct). Is there something I'm missing?
_________________
DS - It's all about DiscoStew

#139183 - DekuTree64 - Mon Sep 03, 2007 6:48 am

Yeah, restore matrix is a lot like a pop, except that you specify the stack location, and it doesn't modify the "stack pointer".

Maybe you need to wait a bit for the push command to execute, before reading the current stack location.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#139184 - DiscoStew - Mon Sep 03, 2007 7:38 am

Thanks, that did it, though I was sure I had tried that before and it still ended up with the same corruption I've been having. Well anyways, thanks again.
_________________
DS - It's all about DiscoStew

#139194 - kusma - Mon Sep 03, 2007 10:40 am

DiscoStew wrote:

But, I've been looking into optimizing my overall code for displaying 3D models, as the push takes 17 cycles and the pop takes 36 cycles (at 33 Mhz units).


Wait... It's not that I don't think it's weird that the push and the pop have so different timings, but... Are you SURE you need to optimize something that is a per-draw-batch operation and yet less than 50 cycles? It sounds to me like you're wasting your time with optimizations that won't give any noticeable performance boost...

#139225 - DiscoStew - Mon Sep 03, 2007 6:42 pm

kusma wrote:
DiscoStew wrote:

But, I've been looking into optimizing my overall code for displaying 3D models, as the push takes 17 cycles and the pop takes 36 cycles (at 33 Mhz units).


Wait... It's not that I don't think it's weird that the push and the pop have so different timings, but... Are you SURE you need to optimize something that is a per-draw-batch operation and yet less than 50 cycles? It sounds to me like you're wasting your time with optimizations that won't give any noticeable performance boost...


I guess I'm just a sucker for seeing something that could be optimized, but only gives a little improvement, though it looked like I could get quite a bit from it. The push/pop were previously used for calculating vertices down a chain of bones, and I've been working with 5 300 tri models, all smooth-skinned with normals calculated every frame. Results from hardware looks like it's all happening at ~60 FPS (just everything involved with calculating the models and getting them rendered), though I have no code to prove it is running at that speed, and the profiling code I have wasn't working until I added some code to get "something" out of it.

I figured that if I'm running this loop code around 900 times total per model (from my 300 tri models), reducing the cycles by even 17 cycles per loop would have helped.
_________________
DS - It's all about DiscoStew

#139234 - kusma - Mon Sep 03, 2007 8:40 pm

DiscoStew wrote:
I figured that if I'm running this loop code around 900 times total per model (from my 300 tri models), reducing the cycles by even 17 cycles per loop would have helped.

Well, you're optimizing in the wrong end. Doing matrix-matrix multiplications per vertex is not a very good idea ;)

#139236 - DiscoStew - Mon Sep 03, 2007 8:59 pm

kusma wrote:
DiscoStew wrote:
I figured that if I'm running this loop code around 900 times total per model (from my 300 tri models), reducing the cycles by even 17 cycles per loop would have helped.

Well, you're optimizing in the wrong end. Doing matrix-matrix multiplications per vertex is not a very good idea ;)


It's how my format is set up, unfortunately. This is how it goes...

Basically every bone has a list of vertices that correspond to it, with a weight and difference translation from the bone to the vertices when the bone is positioned in it's BindOnPose position. Once the bone's BindOnPose matrix and user-defined matrix have been multiplied into the current matrix, I run through the loop, getting the necessary vertex, weight, and difference translation from the bone for that vertex, do the translation, grab the final translation, multiply it by the weight, and store the value, where it can be later added on by other bones affecting the vertices. The push/pop (or the newly altered version with restore) is to hold the unaltered matrix available for each one.

Now, this was all thought up by observing what Maya was doing, so I doubt that Maya is doing it this way. I don't have any real knowledge about 3D except for what I'm learning along the way with programming this, as far as how they are constructed and processed anyways. I'm sure there are faster ways of doing this, but from the perspective of hardware matrix multiplying and manual matrix multiplying, because the hardware does it faster, I chose that route.
_________________
DS - It's all about DiscoStew

#139249 - kusma - Mon Sep 03, 2007 11:23 pm

DiscoStew wrote:

Code:

*** loop start ***

glPushMatrix();
*** do calculations ***
glPopMatrix(1);

*** loop end ***



Having read your explanation below, are you sure you're not supposed to do something like:

Code:

glPushMatrix();

*** loop start ***

*** do calculations ***

*** loop end ***
glPopMatrix(1);


instead of pushing/popping the matrix per vertex? Skinning shouldn't require any per vertex matrix-operations, and if you already rolled your data-processing to have the per-bone vertices as the inner-loop, you should have a pretty good data-flow.

#139250 - kusma - Mon Sep 03, 2007 11:30 pm

Ideally, this is what a skinning-loop looks like:

Code:
for v in result:
   v = (0, 0, 0)

for b in bones:
   mat = b.getBoneMatrix()
   for v in b.vertices:
      result[v.index] += (mat * vertices[v.index]) * v.weight

#139256 - DiscoStew - Tue Sep 04, 2007 12:41 am

I probably should have also mentioned, that the only place where the results of the finished bone matrices are located are left in the matrix stack, and that I've been trying to have everything involving matrices being calculated via the hardware, as it is supposably faster.

See, my code runs off a recursion function, that begins with the Identity matrix, then goes into the recursive function, starting with the root bone, multiplying it with the BindOnPose matrix of that bone and the matrix created from user-defined data (such as rotation and scaling). Once that is done, I push that matrix, so the one below is preserved and can be retrieved when dealing with the vertices. Then the loop for all vertices associated with the bone are processed, using the DS equivalent of glTranslate on the current matrix (which the value is the translation equivalent of the vertices from the bone), multiplying the weight by those finalized values from the matrix, and storing them. After those are done, it loops through all the children of that bone, and goes into the recursive function, doing everything for that bone, and so on and so forth until there are no children left to process, of which it pops the matrix before exiting each recursion.

Basically, the difference I see with how I'm doing it vs your method you showed, kusma, is that in yours, the bones are already calculated and stored for use, whereas I am calculating the bones alongside the vertices without actually storing the matrices other than in the matrix stack.

I'm just trying to use the hardware as best as I can, after finding out about using the hardware's division and sqrt functions alongside manual calculation (in my post about optimizing normals).

Am I making sense with my method?
_________________
DS - It's all about DiscoStew

#139259 - kusma - Tue Sep 04, 2007 1:00 am

DiscoStew wrote:
Basically, the difference I see with how I'm doing it vs your method you showed, kusma, is that in yours, the bones are already calculated and stored for use, whereas I am calculating the bones alongside the vertices without actually storing the matrices other than in the matrix stack.


The getBoneMatrix()-method was intended for hiding away just that. Anyway, if you traverse the bones in the order of the hierarchy in the mesh, the matrix pushing/popping should limit itself to one of each kind per bone.

something like this:

Code:
for v in result:
   v = (0, 0, 0)

traverse(n):
   pushMatrix()
   for b in n.bones:
      mat = multMatrix(b.getLocalMatrix())
      for v in b.vertices:
         result[v.index] += (mat * vertices[v.index]) * v.weight
      if (count(b.bones) > 0) traverse(b)
   popMatrix()

#139264 - DiscoStew - Tue Sep 04, 2007 2:30 am

kusma wrote:


something like this:

Code:
for v in result:
   v = (0, 0, 0)

traverse(n):
   pushMatrix()
   for b in n.bones:
      mat = multMatrix(b.getLocalMatrix())
      for v in b.vertices:
         result[v.index] += (mat * vertices[v.index]) * v.weight
      if (count(b.bones) > 0) traverse(b)
   popMatrix()


Now I'm just confused as to your explanation, because what I see from your example, is basically what I have had after making the glRestoreMatrix modification. The transverse being the recursion, multiplying the vertices with the bone's finalized matrix, altering with the weight, etc.

The only real difference, if I understand it correctly, is how the actual vertex calculation is done, whether by hardware or software. My bone matrix is left in the hardware matrix stack, while yours is saved to the "mat" location. Mine is calculated within the hardware still, and yours outside.

I guess what I'm trying to get across is that the "mat" in your code is like my use of glRestoreMatrix without actually storing it in another memory location, as where I'm grabbing the matrix I need is just under the current matrix. However, with yours, you aren't reloading the matrix you need, as it is saved in "mat". Mine does require reloading, because the current matrix continues to get altered after each vertex is processed, so I'd have to reload from the matrix stack onto the current matrix by use of glRestoreMatrix.

Perhaps the question needing to be asked is "is it better to do this via software than hardware?" Using the hardware requires more commands to be done, but because it is supposably faster at the actual calculations, wouldn't it at least be comparable in terms of processing it all as the software method?
_________________
DS - It's all about DiscoStew

#139296 - DiscoStew - Tue Sep 04, 2007 8:37 am

Well, I took a little bit of time to think about this, and I'm actually thinking that it is better to calculate the vertices via software for this reason....

Multiplying a 4x4 matrix with a 3D vector requires 9 multiplies, 9 additions, and either 3 or 9 shifts (multiply, add, shift....or multiply, shift, add). By using the hardware, in order to get the Translation values, I have to...

1) Load the needed matrix with glRestoreMatrix (36 cycles)
2) Do the translation with glTranslate (22 + 30 cycles because of MODELVIEW mode)
3) Switch to PROJECTION mode (1 cycle)
4) Push that matrix (17 cycles)
5) Multiply by the Identity matrix (19 cycles)
6) Retrieve the values from the clip array[12 to 14] (? cycles)
7) Pop that matrix (36 cycles)
8) And switch back to MODELVIEW mode (1 cycle)

Lots of stuff happening, and not counting the cycles needed to get each of those commands going, as well as retrieving the translation values, that's about 162 cycles required by the hardware (at 33Mhz?). I get the feeling though that 9 multiplies, 9 additions, and even 9 shifts won't take up nearly that much within ITCM (code and data). Thoughts?

Just my thought for the night, I'm dead tired and need sleep. Perhaps this week, I'll back up my project, and give it a face lift to test this out.
_________________
DS - It's all about DiscoStew

#139299 - DekuTree64 - Tue Sep 04, 2007 9:48 am

Why are you pushing and popping the projection matrix every time, rather than doing it outside the loop?

You could try using the "position test" registers. It's basically a hardware matrix multiply (though poorly named). Then you wouldn't need to do the matrix restore every time, since the translate would be gone.
Hey, you could even alternate, doing one vertex in software, while the hardware does another.

But actually, I think you could get the most speed out of doing it entirely in software, using the ARM9 32x16 bit smla instruction (single cycle multiply-shift down-add, very awesome).
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#139318 - DiscoStew - Tue Sep 04, 2007 5:33 pm

DekuTree64 wrote:
Why are you pushing and popping the projection matrix every time, rather than doing it outside the loop?

You could try using the "position test" registers. It's basically a hardware matrix multiply (though poorly named). Then you wouldn't need to do the matrix restore every time, since the translate would be gone.
Hey, you could even alternate, doing one vertex in software, while the hardware does another.

But actually, I think you could get the most speed out of doing it entirely in software, using the ARM9 32x16 bit smla instruction (single cycle multiply-shift down-add, very awesome).


As far as the push/pop with the projection matrix, it was because that is supposably part of the method for obtaining the Position Matrix, as without it, everything screws up....

GBATEK wrote:
The Clip Matrix is internally re-calculated anytime when changing the Position or Projection matrices: ClipMatrix=PositionMatrix*ProjectionMatrix, this matrix is internally used to convert vertices to screen coordinates.
To read only the Position Matrix, or only the Projection Matrix: Use Load Identity on the OTHER matrix, so the ClipMatrix becomes equal to the DESIRED matrix (multiplied by the Identity Matrix, which has no effect on the result).


I even tried just removing them just for fun, and I ended up getting nothing on the screen. However, I am still learning on how to work with the hardware, and gave no thought to using the "position test" method, as I was too focused on what I currently had. For now, I'm just gonna edit my code for use with the position test to see how well it performs, then later try the software method.


EDIT:

I just used the position test, and while I'm glad such a function exists, I ended up having to "negatize" the results so everything looked liked how it was prior to implementing it (using the function brought about a flip of all axis of the model, still keeping it formed, but flipped in all directions, as well as screwed up the lighting).

The position test was simple to implement in the current code, which sped things up quite nicely, but for the software try, I'm planning on altering a bit more code to accommodate it. Most likely I'll change how and where the bone matrices are done.
_________________
DS - It's all about DiscoStew

#139531 - DiscoStew - Fri Sep 07, 2007 4:51 am

After some grueling hours trying to figure out the problem with the software implementation (not the vertices, but my new bone method), I finally got it working. Although it is slightly slower than my previous method (which I'll ask about optimizing further down), I gained a few nice features.

I'm still using the matrix hardware for calculating the bones, but before, I was limited by the number of available slots on the matrix stack. Now, only a max of 2 slots are needed, allowing an almost unlimited number of bones chained together (though I doubt there will be a need on the DS).
The bones now get stored for calculating vertices via software, but what could be a great advantage is that the entire list of bones won't have to be updated each time, just the bone chains that actually change.


DekuTree64:

About the smla instruction. My project isn't using any assembly code, and I'm sure that when it is compiled, the compiler isn't optimizing the vertex code to use smla instruction.

Ex. of my current vertex code
Code:
VertCalc = (mat[0] * vert[0]) >> 12;
VertCalc += (mat[4] * vert[1]) >> 12;
VertCalc += (mat[8] * vert[2]) >> 12;
VertCalc += mat[12];
VertStore[VertIndex] += (VertCalc * VertWeight) >> 12;


Think you could help me out with implementing your idea?
_________________
DS - It's all about DiscoStew

#139538 - DekuTree64 - Fri Sep 07, 2007 10:10 am

DiscoStew wrote:
About the smla instruction. My project isn't using any assembly code, and I'm sure that when it is compiled, the compiler isn't optimizing the vertex code to use smla instruction.

Yeah, you'll have to do the vector x matrix multiply in assembly to use those instructions. The ones you'll be using are smulwx/smlawx, where x is either t or b (multiply by the top or bottom 16 bits of the third argument register). So it's basically a 32x16 bit signed multiply, which gives a 48-bit result. The bottom 16 bits are discarded, and the upper 32 are written to the result register, effectively shifting down 16 bits.

Because of the 16 bit shift, the matrix will need to be in 16.16 format. If you're using the hardware to do matrix multiplies, then it would probably be best to copy the matrix to RAM, and shift up 4 bits while you're at it. For that matter, re-order the elements of the matrix to make it faster too. In your example you're grabbing matrix elements 0, 4, 8, and 12. Change it so those are 0, 1, 2, and 3. I guess this is just transposing it.

Then the actual multiply function might look something like this:
Code:
@ void BoneVertexMultiply(const s16 *vertex, const s32 *matrix, s32 weight, s16 *dest)
@ Multiply vertex by matrix, multiply by weight, and add into dest vertex.
@ Vertex is 4.12 fixed point.
@ Matrix is 16.16 fixed point.
@ Weight is 16.16 fixed point.
@ Dest is 4.12 fixed point.

#define rVertex   r0
#define rMatrix   r1
#define rWeight   r2
#define rDest     r3
#define rvx       r4 @ x of source vertex
#define rvy       r5 @ y of source vertex
#define rvz       r6 @ z of source vertex
#define rmx       r7 @ Current matrix row's x
#define rmy       r8 @ Current matrix row's y
#define rmz       r9 @ Current matrix row's z
#define rmTrans   r10 @ Current matrix row's translation
#define rDestVal  r11 @ Temp register, to accumulate to dest
@ r0 is free to use after loading the vertex at the start
#define rResult   r0

.global BoneVertexMultiply
.arm
.align 2
BoneVertexMultiply:
stmfd sp!, {r4-r11, lr}
ldrsh rvx, [rVertex]
ldrsh rvy, [rVertex, #2]
ldrsh rvz, [rVertex, #4]

@ Load x axis of the matrix, and dot product the vertex with it. Add the translation, multiply by the weight, and add into the destination vector's x component.
ldmia rMatrix!, {rmx, rmy, rmz, rmTrans}
ldrsh rDestVal, [rDest]
smlswb rResult, rmx, rvx, rmTrans
smlawb rResult, rmy, rvy, rResult
smlawb rResult, rmz, rvz, rResult
smlawb rResult, rWeight, result, rDestVal
strh rResult, [rDest]!

@ Same for y
ldmia rMatrix!, {rmx, rmy, rmz, rmTrans}
ldrsh rDestVal, [rDest]
smlawb rResult, rmx, rvx, rmTrans
smlawb rResult, rmy, rvy, rResult
smlawb rResult, rmz, rvz, rResult
smlawb rResult, rWeight, result, rDestVal
strh rResult, [rDest]!

@ Same for z
ldmia rMatrix!, {rmx, rmy, rmz, rmTrans}
ldrsh rDestVal, [rDest]
smlawb rResult, rmx, rvx, rmTrans
smlawb rResult, rmy, rvy, rResult
smlawb rResult, rmz, rvz, rResult
smlawb rResult, rWeight, result, rDestVal
strh rResult, [rDest]!

ldmfd sp!, {r4-r11, pc}


Hopefully that's not too scary looking... it's really not that much though, just a bunch of comments/register name #defines, and an unrolled loop.
Also, I just wrote it right here in the post window, so who knows what it actually does. But it's time for me to sleep.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#139540 - kusma - Fri Sep 07, 2007 10:23 am

DekuTree64 wrote:

smlswb rResult, rmx, rvx, rmTrans
smlawb rResult, rmy, rvy, rResult
smlawb rResult, rmz, rvz, rResult
smlawb rResult, rWeight, result, rDestVal
strh rResult, [rDest]!

...and there you lost the single cycle-beauty of SMLAWx, it's only single-cycle if you don't use the result in the next instruction :)

#139682 - DiscoStew - Sat Sep 08, 2007 9:20 pm

I just gave a try to the assembly route, and it's coming up with a bunch of errors, involving bad arguments, bad instruction, select processor does not support, and missing `}'.

EDIT: Only the "select processor does not support' error is left, as the rest involved the comments on the same line as the defines.

Code:

c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(37): Error: selected processor does not support `smlawb r0,r7,r4,r10'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(38): Error: selected processor does not support `smlawb r0,r8,r5,r0'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(39): Error: selected processor does not support `smlawb r0,r9,r6,r0'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(40): Error: selected processor does not support `smlawb r0,r2,result,r11'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(46): Error: selected processor does not support `smlawb r0,r7,r4,r10'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(47): Error: selected processor does not support `smlawb r0,r8,r5,r0'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(48): Error: selected processor does not support `smlawb r0,r9,r6,r0'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(49): Error: selected processor does not support `smlawb r0,r2,result,r11'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(55): Error: selected processor does not support `smlawb r0,r7,r4,r10'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(56): Error: selected processor does not support `smlawb r0,r8,r5,r0'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(57): Error: selected processor does not support `smlawb r0,r9,r6,r0'
c:/My_Projects/DS_Projects/Test_Project/arm9/src/BoneVertMult_asm.s(58): Error: selected processor does not support `smlawb r0,r2,result,r11'



All that's left is the actual meat of the code, involving 'smlawb'. It says that the 'select processor' does not support that command. Do I need to make a change in my make file for that?
_________________
DS - It's all about DiscoStew