gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Optimization ideas, mapping from 320x200 to 512x256 screen

#171165 - Pate - Wed Nov 04, 2009 6:04 am

Hi!

In the emulator I am working on I need to map screen access from source screen of 320x200 to NDS VRAM organized as 512x256 pixels. The current code I use is OK, but I don't think it is as optimal as it could be, so I thought why not ask for help from you gurus? Do you have ideas for optimizing this screen coordinate translation?

Current code is below, I especially don't like the extra lsl r1, #16 whose purpose is simply to clear the high halfword of r1. Can I get rid of that using eor or some such?

The input "idx" register high halfword contains offset to 320x200x8bit screen, and the macro should return in r2 the corresponding address in VRAM.

Thanks!

Code:

.macro r2_MCGA_from_idx idx
   ldr      r1, =y320to512tbl
   mov      r0, \idx, lsr #(16+6)   @ r0 = idx/64 = table index
   ldr      r1,[r1, r0, lsl #2]    @ r1 high = 320*Y, r1 low = Y
   mov      r0, \idx, lsr #16
   sub      r0, r1, lsr #16         @ Now r0 = X coordinate on this screen row
   lsl      r1, #16
   add      r0, r1, lsr #(16-9)      @ Now r0 = X+512*Y = Screen address
   add      r2, r0, #0x06000000
.endm

   .data
   .align 2

.macro y320to512x1 val
   .word   ((320*(\val))<<16)+(\val), ((320*(\val))<<16)+(\val), ((320*(\val))<<16)+(\val), ((320*(\val))<<16)+(\val), ((320*(\val))<<16)+(\val)
.endm
.macro y320to512x10 val
   y320to512x1 \val
   y320to512x1 \val+1
   y320to512x1 \val+2
   y320to512x1 \val+3
   y320to512x1 \val+4
   y320to512x1 \val+5
   y320to512x1 \val+6
   y320to512x1 \val+7
   y320to512x1 \val+8
   y320to512x1 \val+9
.endm

y320to512tbl:
   y320to512x10 0
   y320to512x10 10
   y320to512x10 20
   y320to512x10 30
   y320to512x10 40
   y320to512x10 50
   y320to512x10 60
   y320to512x10 70
   y320to512x10 80
   y320to512x10 90
   y320to512x10 100
   y320to512x10 110
   y320to512x10 120
   y320to512x10 130
   y320to512x10 140
   y320to512x10 150
   y320to512x10 160
   y320to512x10 170
   y320to512x10 180
   y320to512x10 190



Pate
_________________

#171166 - Ruben - Wed Nov 04, 2009 6:23 am

Well, if this macro is called more than once, then preload the table address into r3 or ip (r12) and pass that as an argument to the macro.

And I'm not entirely sure what it is you're doing exactly so I'm not really sure I can help much with the code, but it may help to look at your code visually, like so...
Code:
mov r0, \idx, lsr #16 @ 0000FEDC
sub r0,   r1, lsr #16 @ 0000FEDC - 0000BA98
mov r1,   r1, lsl #16 @ 44440000


EDIT:

It may also help to make this macro cleaner
Code:
.macro y320to512x1 val
  .word (320<<16) * \val + \val
  .word (320<<16) * \val + \val
  .word (320<<16) * \val + \val
  .word (320<<16) * \val + \val
  .word (320<<16) * \val + \val
.endm

@ since it's repeated, can be
@ brought down to this

.macro y320to512x1 val
  .rept 5
    .word (320<<16) * \val + \val
  .endr
.endm

#171168 - Pate - Wed Nov 04, 2009 8:31 am

Thanks for the reply Ruben!

Ruben wrote:
Well, if this macro is called more than once, then preload the table address into r3 or ip (r12) and pass that as an argument to the macro.


Sadly, all other registers are in use for other things, only r0 and r1 (and r2 within the macro) are free to use. I plan to put the table in stack eventually so I can get rid of one ldr, but I am currently more interested in a better algorithm.

Quote:

And I'm not entirely sure what it is you're doing exactly ...


Well, the macro just converts between 320 pixels per row and 512 pixels per row, so that for example input offset of 319 (meaning coordinates (319,0)) will map to output offset 319, but input 320 (0,1) will map to output offset 512 (again coordinates 0,1).

EDIT: In other words, I'm looking for the most efficient ASM solution for this C code:

Code:

r2 = (idx/320)*512 + idx%320;


Quote:

It may also help to make this macro cleaner


Ah, I wasn't aware of ".rept", thanks for the info! The macro just builds the table, but in the future that rept will be useful!

Pate
_________________

#171169 - Ruben - Wed Nov 04, 2009 9:10 am

I'm not sure if I understand correctly, but if I do this should do it without look-up tables
Code:
.macro r2_MCGA_from_idx idx
  mov     r2, \idx, lsr #16  @ get offset
  ldr     r1, =0xCCCCCD      @ 1/320 (q32, rounded)
  umull   r2, r1, r1, r2     @ r2 = idx/320 (this is y)
  mov     r1, r2, lsl #6     @ x*64
  add     r1, r1, r2, lsl #8 @ x*64 + x*256 = x*(64+256) = x*320
  rsb     r1, r1, \idx       @ idx-(idx/320*320) = idx%320
  add     r1, r1, r2, lsl #9 @ r1 = (idx%320) + (idx/320*512)
  mov     r1, r1, lsl #1     @ (u16*)r1
  add     r2, r1, #0x6000000 @ &VRAM[idx%320 + idx/320*512]
.endm

What it does is that, every 320 'tiles', it jumps by 512 tiles, and then advances by x with idx%320.

EDIT:
Note that this frees up r0 so you can preload the reciprocal here and multiply by r0 instead of r1.

EDIT 2:
Changed code a bit so you don't have to sacrifice the \idx register.

EDIT 3:
Note that this is about 4 cycles slower, so if you need speed, I may not be the best person to ask. If you need space, use mine.

#171171 - FluBBa - Wed Nov 04, 2009 9:40 am

I have a faint memory of this already been discussed...
Anyway, always try to combine shifts with actual logical operations.
This:
Code:

  mov     r1, r2, lsl #6     @ x*64
  add     r1, r1, r2, lsl #8 @ x*64 + x*256 = x*(64+256) = x*320
  rsb     r1, r1, \idx       @ idx-(idx/320*320) = idx%320

can easily be turned into:
Code:

  add     r1, r2, r2, lsl #2     @ x*5
  sub     r1, \idx, r1, lsl # 6       @ idx-((idx/320)*(5*64)) = idx%320

_________________
I probably suck, my not is a programmer.

#171172 - Cearn - Wed Nov 04, 2009 9:42 am

Is there any particular reason you're not using an affine background for this? You could use a 320x200 field on a 512x256 bitmap and let the hardware scale it for you.

Also, what sort of access do you really need? Do you have an offscreen buffer that you need to scale to VRAM( so that you can read the pixels sequentially), or do you need random access?

#171173 - Pate - Wed Nov 04, 2009 10:08 am

Thanks for the replies all! I am rather new to ARM ASM, so all tricks you can show are always very interesting. :-)

Cearn, I'm not quite sure I understand what you mean by an affine background? On input, I have just a pixel index into a 320x200 buffer, that is, an integer between 0 and 64000, and on output I need an index into VRAM, an integer between 0 and 102400 (plus 0x06000000).

I need random access, as the software I emulate may write whatever number of pixels where ever it wants. Obviously, when storing a number of pixels at once (like in REP MOVSW), I can just calculate this original offset once. I tried with an offscreen 320x200 buffer that I copy to VRAM at 30fps, but that turned out to be much slower than attempting to emulate screen writes directly to NDS VRAM.

Ruben, I need speed, I have a lot of RAM so space is not a concern (at least at the moment).

Pate
_________________

#171175 - Cearn - Wed Nov 04, 2009 3:04 pm

Pate wrote:
Cearn, I'm not quite sure I understand what you mean by an affine background?
Affine backgrounds are those that support hardware affine transformations like rotation and --more importantly in your case-- scaling (see tonc:affbg). If you're using an 512x256@8bpp bitmapped background, you're already using an affine BG.

The idea here is that you can set the matrix (REG_BGnPA - REG_BGnPD) to the right scalings so that the hardware stretches the 320x200 bitmap you want to the screen's dimensions. Think of it as a post-processing step: you create a 320x200 in VRAM (or, rather, a 320x200 area on a 512x256 BG) and use the hardware scaling to make sure it looks right on the screen.

Pate wrote:

On input, I have just a pixel index into a 320x200 buffer, that is, an integer between 0 and 64000, and on output I need an index into VRAM, an integer between 0 and 102400 (plus 0x06000000).

Hmm. I'd have thought you'd start with pixel coordinates, not a pixel index. This would complicate matters a little.

Pate wrote:

I need random access, as the software I emulate may write whatever number of pixels where ever it wants. Obviously, when storing a number of pixels at once (like in REP MOVSW), I can just calculate this original offset once. I tried with an offscreen 320x200 buffer that I copy to VRAM at 30fps, but that turned out to be much slower than attempting to emulate screen writes directly to NDS VRAM.

Remember that VRAM does not allow byte-writes. The extra work needed to do that might end up being slower than a RAM or a DTCM blit.

#171187 - Pate - Thu Nov 05, 2009 5:50 am

Cearn wrote:
Pate wrote:
Cearn, I'm not quite sure I understand what you mean by an affine background?
Affine backgrounds are those that support hardware affine transformations like rotation and --more importantly in your case-- scaling (see tonc:affbg). If you're using an 512x256@8bpp bitmapped background, you're already using an affine BG.


Ah, okay. Yes, I'm using an affine background already. I don't currently scale the background, mainly so that I can see possible pixel errors, but I will certainly add scaling eventually. I plan to give the user choices whether to scale the screen or move a 256x192 window over the 320x200 screen area.

Quote:

Hmm. I'd have thought you'd start with pixel coordinates, not a pixel index. This would complicate matters a little.


Yeah.. Since the input is something like "mov es:[di],al" where I only have the di register value as input, I need to translate it to the 512x256 background bitmap.

Quote:

Remember that VRAM does not allow byte-writes. The extra work needed to do that might end up being slower than a RAM or a DTCM blit.


Yeah, I assumed it would be slower, too, but empirical testing proved me wrong. The software I am emulating already uses internally an offscreen buffer, and copies that buffer to the screen using "rep movsw", and then updates some extra screen pixels directly. If I use the blit I will in effect create an additional layer of offscreen buffers, which is probably what actually kills the performance.

Anyways, in the final product I will again give the user a choice between offscreen blit and direct screen hardware emulation.

Thanks for the replies, again!

Pate
_________________

#171190 - sverx - Thu Nov 05, 2009 9:15 am

Pate wrote:
Since the input is something like "mov es:[di],al" ...


... are you doing an 8086 emulator? 8|

#171192 - Pate - Thu Nov 05, 2009 11:06 am

sverx wrote:
Pate wrote:
Since the input is something like "mov es:[di],al" ...


... are you doing an 8086 emulator? 8|


Yep. I've been rather silent about it until I was sure it will work, but now it is beginning to look pretty good so I might as well admit it. :-)

Pate
_________________

#171194 - sverx - Thu Nov 05, 2009 4:28 pm

Wow! :) I was also considering to do that since some time, but I lack of experience... is it a port of DosBox or a complete rewrite? :)

#171195 - Pate - Thu Nov 05, 2009 8:15 pm

I started from scratch, right after I got LineWarsDS finished last July, and have been working on it since. I hope to announce it properly in a few days, I'm currently making a web site for it. It is still *very* far from finished, but I have been putting together a sort of technology demo for the last couple of weeks that is almost ready for release.

I trust you can wait a few days. :-)

Pate
_________________

#171208 - sverx - Fri Nov 06, 2009 10:48 am

I was wondering if it wouldn't be better to use a 15bpp 'direct color' bitmapped background instead of the 8bpp 'paletted' one...

... this way you won't have to read the halfword/change the byte/write the halfword again for every pixel you want to set to the screen in your VGA mode 13h ...

Of course that needs a memory to store RGB values, but you could already set it as an array of 256 halfword value in a fast memory (DTCM for instance?) thus making your byte (the pixel!) the index of the element of this array that should be copied to VRAM.
You could also make gamma correction easily...

Just a thought, btw I wanted to express it :)

[oh, I noticed now you wrote that it's using a rep movsw instruction... well, so maybe a 8 bpp it's faster in that case, you just copy 4 pixel each time...]

#171211 - Pate - Fri Nov 06, 2009 12:05 pm

Yeah, I thought about that, too, but then palette animation (fade-in/fade-out) would be REALLY expensive, so I thought it would be better to emulate the graphics using palettes.

Pate
_________________

#171212 - sverx - Fri Nov 06, 2009 1:50 pm

Yeah, you're right, I didn't think that much before writing... :|

#171215 - Exophase - Fri Nov 06, 2009 6:40 pm

I assume that this is for capturing writes to the emulated framebuffer and converting them to the NDS framebuffer, correct? Actually, I'm surprised that the emulated framebuffer even has a pitch of 320, but I'll trust that you know what's going on with that.

It should be possible to get a 320x200 framebuffer in DS using sprites instead of a bitmap BG. It's a little complex though. What you'd have to do is have five 64 wide sprites laid out next to each other (probably scaled) and setup HDMA or horizontal IRQs so that their tile positions are updated every line. It'd look like this:

Line 0: 0, 1, 2, 3, 4
Line 1: 5, 6, 7, 8, 9
Line 2, 10, 11, 12, 13, 14

Etc. The offsets would have to be done in sprite mapping modes using 64 pixel offsets. I do believe this would cost less than translating the framebuffer writes in most cases - especially if removing the burden of having to capture those writes might relieve you of having to emulate memory at that level entirely. I don't know what else is mapped to the address space, but I do know x86 has a port space so it could work out perhaps.

#171222 - Pate - Sat Nov 07, 2009 4:48 pm

Quote:
I assume that this is for capturing writes to the emulated framebuffer and converting them to the NDS framebuffer, correct?


Yes, that is correct. Your idea for using sprites and IRQ sounds a bit scary, I have to admit! I haven't done anything with sprites yet, so I don't quite understand this method, but this is good to keep in mind.

Btw, I announced my emulator now "officially" in the thread http://forum.gbadev.org/viewtopic.php?t=16956

Pate
_________________

#171224 - Exophase - Sat Nov 07, 2009 9:32 pm

Pate wrote:
Quote:
I assume that this is for capturing writes to the emulated framebuffer and converting them to the NDS framebuffer, correct?


Yes, that is correct. Your idea for using sprites and IRQ sounds a bit scary, I have to admit! I haven't done anything with sprites yet, so I don't quite understand this method, but this is good to keep in mind.

Btw, I announced my emulator now "officially" in the thread http://forum.gbadev.org/viewtopic.php?t=16956

Pate


Congratulations on the release. If you ever feel like sharing source I can see if I have any ideas for optimizations. As far as the 320x200 sprite framebuffer idea goes, maybe if I finally install a DS toolchain I can throw together an example of it >_>

UPDATE: Sorry, was talking to someone else and realized my suggestion only works for bitmap modes. Since I'm sure your framebuffer isn't 15bpp that's out. You can make it work with 8bpp or 4bpp, but it means using 40 8x8 sprites. While you'd still have more than enough hblank time for it I think this will start to cost you a lot in bus contention.

There's another way to do it for 8bpp modes, I think. You can use two 512 wide bitmap layers. Basically, the problem you want to solve here is you want to draw 320 pixels starting at an arbitrary position in the layer. You can start at an arbitrary start position in the layer by using the horizontal and vertical offset registers, but the issue with using one layer is that you'll hit an edge every 512 pixels. To get around this, you can have another layer draw the rest of them by using negative horizontal offsets to make it start in the middle of the screen. You need to have wrapping off for both layers.

Let me try to demonstrate. You have two bitmap layers, A and B. Note that I'll be describing this pretending that DS has a 320 wide display: in reality you'd have to use scaling to make it so. But it should still work.

On scanline 0 you draw layer A using offset 0, 0 and layer B using something to make it fully offscreen like -320, 0.

Then on scanline 1, layer A's offset is 320, 0. This means that it draws 192 pixels starting from the left edge of the screen until it hits the edge at 511, 0. If you make layer B's offset -192, 1 then it'll draw 128 pixels starting at pixel 192.

This should repeat okay for all the layers.

I think you don't even need to use HDMA, but can use dmx/dmy to change the offsets per line. Basically it ends up looking like two sheared layers aligned against each other.

#171243 - Pate - Mon Nov 09, 2009 5:51 am

Okay, I think I understand the method you are describing. But isn't having an IRQ triggered 192*60 times a second, with all the additional overhead it causes, a high price to pay just to get rid of a table lookup when changing a pixel on the screen?

I am not very familiar with the IRQ overhead on ARM, but back in my x86 ASM days IRQs were pretty expensive...

Pate
_________________

#171259 - Exophase - Tue Nov 10, 2009 12:23 am

Pate wrote:
Okay, I think I understand the method you are describing. But isn't having an IRQ triggered 192*60 times a second, with all the additional overhead it causes, a high price to pay just to get rid of a table lookup when changing a pixel on the screen?

I am not very familiar with the IRQ overhead on ARM, but back in my x86 ASM days IRQs were pretty expensive...

Pate


I suggest you study HDMA, it doesn't involve IRQs. And the method I described with the BG shearing can probably be done fully automatically using the affine transform registers.

It's an awful lot more than just getting rid of a table lookup, it's getting rid of an entire class of memory accesses you have to trap. It might even simplify the memory emulation entirely, making it faster.

#171261 - Pate - Tue Nov 10, 2009 5:45 am

Ah okay, I think I need to look into HDMA. I still got a lot of things to learn about NDS, the good thing is that DSx86 is a project where I can most likely use a lot of the stuff I learn. :-)

Pate
_________________

#171297 - Exophase - Thu Nov 12, 2009 4:47 pm

I finally tried doing a demo of this. Unfortunately I found that my basic approach wouldn't work as-is because I incorrectly thought that you'd be able to offset into the middle of the DS scanline and the virtual framebuffer - you really only get one or the other, not both. What I said about 40 sprites would also never fly since you can't offset them in 8 pixel increments.

It also probably wouldn't have worked w/o HDMA because you need to increment by more than the +/- 128 pixels per line that the built-in affine transformation can handle. You can probably still do it for the Y coordinate; I haven't tried it yet. This might not work anyway because writing to the X offset might reload the Y offset. It's something I'll need to look into.

Fortunately, I did still find a way to make this work and I have actually verified it.

As I mentioned, the basic problem with the 512-wide framebuffer is that when you offset into it in 320-wide increments you start hidding the edge of the screen at various points and get a bunch of transparent lines coming off the right edge of the display. You can make it so when it hits the edge it wraps around to the start instead of turning transparent, but what you really want is for it to "wrap" around to the NEXT scanline.

After realizing this it was pretty obvious what you can do - have a non-wrapping layer on top and a wrapping layer on the bottom, where the wrapping layer is one scanline ahead of the non-wrapping one (achieved by setting the vertical offset one higher).

This, however, poses a new problem: since you're relying on the two layers overlapping, the top layer will no longer have its 0 pixels go to background but instead will show throw to the (incorrect) pre-wrap data of the bottom layer.

In my demo I avoided this by getting rid of 0 pixels altogether, but obviously that won't work for something you're emulating. I can think of a couple ways to handle this:

- Use a window to clip the layer. This will need another HDMA channel, meaning you'd be using 3 every scanline with only one more available during the frame refresh. This isn't necessarily a big deal since you can use those HDMA channels for other things during vblank (so long as you set them back afterwards) but if you need significantly more HDMA or other things for DMA at any time you could find yourself pinched. It'll also use more bus cycles, but not an awful lot, just 16bits worth per line.

- Put another layer inbetween the two layers that maskes off the parts of the bottom one you don't want to see. The only reasonable way to accomplish this would be to use extended palette BG or sprites. The extended palette only needs one color: the BG color (you can make it color 1, for instance). Copy the real BG color to it, and use it where you want to mask off the pre-wrap portions of the layer. Using a BG layer would probably be smarter than using sprites since the pattern for the mask repeats nicely and you should have enough space for the tilemap and tiles in the leftovers of the 512x256 framebuffer that you aren't using. You'd want to copy the BG color during vblank - this would cause problems if the game modified the color mid-frame, but if games do mid-frame effects you'll have bigger problems anyway, and I doubt an awful lot did. The only real downside to this approach is that it consumes a bank for extended palettes.

You can also do it with 16bpp sprites, but you'd have to update them every time color 0 changes so it'd be a really bad idea. Personally I'd go for the second option I gave, since it doesn't waste any more bus time. I'll try to implement it then upload the demo later if you still want it.

#171311 - Pate - Fri Nov 13, 2009 7:12 am

Yes, a demo of your technique would be very interesting! Much of the stuff you describe goes somewhat over my head, so if you can show a source code that performs that it would be very useful!

Thanks!

Pate
_________________

#171312 - Dwedit - Fri Nov 13, 2009 8:52 am

Here's some screenshots...

Start with the Original Image [Images not permitted - Click here to view it]
Reinterpret the 320x200 image as a 512x256 image [Images not permitted - Click here to view it]

So what we're doing is changing the origin at each scanline. Only thing is that the DS does not wrap to the next line, instead it wraps to the same line.
Anyway, you get two images that you need to put together:
Top Layer [Images not permitted - Click here to view it]
Bottom Layer [Images not permitted - Click here to view it]
But the images will always wrap, so they really look like this:
Top Layer with wrapping [Images not permitted - Click here to view it]
Bottom Layer with wrapping [Images not permitted - Click here to view it]

So you use the Window feature to hide the parts you don't want to see.
The portions of the 'top' layer which should be shown are:
320 (full row width)
192
320
64
256
320
128
320
and the pattern repeats every 8 scanlines.

The DS window feature has a 'glitch' where horizontal window range can only be up to 255 pixels wide, so in addition to the window range, you also need to change which background layers are affected by the windows each scanline.

Edit: Wait a minute, the Area Overflow feature might mean you don't need any windowing... I'll look more into this tomorrow.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#171314 - Cydrak - Fri Nov 13, 2009 10:13 am

This is quite a nifty idea... I had a shot at it, turns out you can get it done with a single bitmap!
Code:
// HDMA list for affine settings
struct {
  s16 dx, ldx;  // ldx/dy aren't used, since we're resetting
  s16 dy, ldy;  // the origin after each line.
  s32  x;
  s32  y;
} lineTable[192];

void buildVGALineList() {
  const int dsWidth = 256, dsHeight = 192, bmWidth = 512;
  const int vgaWidth = 320, vgaHeight = 200;
 
  for(int y = 0; y < dsHeight; y++) {
    // HDMA happens after each line, so the list needs to be ahead by one.
    int dsLine    = y == 0? dsHeight-1 : y-1;
   
    int vgaLine   = y * vgaHeight/dsHeight;
    int vgaOffset = vgaWidth * vgaLine;
    int y         = vgaOffset / bmWidth;
    int x         = vgaOffset % bmWidth;
   
    lineTable[dsLine].x  = 0x100 * x;
    lineTable[dsLine].y  = 0x100 * y;
    lineTable[dsLine].dx = 0x100 * vgaWidth/dsWidth;
    lineTable[dsLine].dy = 0x001;
   
    int pixelsToEdge = bmWidth - x;
    if(pixelsToEdge < vgaWidth) {
      lineTable[dsLine].y +=
        -( (pixelsToEdge*dsWidth + vgaWidth-1) / vgaWidth ) & 0xff;
    }
  }
}

The trick is to remember you're not forced to stay on the same line--after all, bitmaps are affine layers too. By setting DY to the smallest increment, you can use Y's fraction as an up-counter... and control the exact spot it will overflow. This code tracks the edge case and fills out suitable counter values for each line.

I've set it up to fit the screen, though it should be possible to show any portion of the bitmap, scaled or not. Most fortunately, the screen is only 256 pixels wide, so there's no risk of Y rolling over twice. The only problem, really, is that pesky width of 512... (Hmm, but perhaps if you reeeally wanted 640, there's that large 1024x512 mode?)

After setting up the list (and making sure your bitmap wraps of course), all you need to do is point the HDMA there. This needs to happen every vblank, since repeat-triggered DMAs will more than happily march into oblivion:
Code:
  DMA_CR(0)   = 0;
  DMA_SRC(0)  = (u32) &lineTable[0];
  DMA_DEST(0) = (u32) &REG_BG3PA;
  DMA_CR(0)   = DMA_ENABLE | DMA_REPEAT | DMA_START_HBL
    | DMA_SRC_INC | DMA_DST_RESET | DMA_32_BIT | 4;

I got away with a single channel, since all the registers are contiguous. The DMA_DST_RESET is very handy: it's similar to DMA_DST_FIX, except that it writes a whole group of registers each time, rather than just one of them... that's 4 words in this case.

(Btw, the affine and window registers are all adjacent as well. I don't think Exo's solution needs more than one channel, either.)

#171321 - Exophase - Fri Nov 13, 2009 5:14 pm

Cydrak wrote:
The trick is to remember you're not forced to stay on the same line--after all, bitmaps are affine layers too. By setting DY to the smallest increment, you can use Y's fraction as an up-counter... and control the exact spot it will overflow. This code tracks the edge case and fills out suitable counter values for each line.


Very smart idea. This is definitely the best solution so far.

Cydrak wrote:
(Btw, the affine and window registers are all adjacent as well. I don't think Exo's solution needs more than one channel, either.)


I thought about that, but I figured it wasn't worth the extra bandwidth to march over the stuff you don't actually need to change. The non-sequential overhead of starting the new transfers should be factored in too, though.

All moot now since your solution only requires changing x and y of a single layer. In fact, it might be possible to strength reduce the changes of y into a fixed dmy. Sadly still can't get around changing x every line.

This does mean your HDMA only needs to be 2, not 4. Just have to set dx and dy permanently beforehand.

#171332 - Cydrak - Fri Nov 13, 2009 9:29 pm

Quote:
I thought about that, but I figured it wasn't worth the extra bandwidth to march over the stuff you don't actually need to change. The non-sequential overhead of starting the new transfers should be factored in too, though.

If you have 3 extra words, it breaks even I think. In my code I have a copy of the bitmap offset by 1/2 pixel and averaged in to try and get some smoothing, so it's faster to set both at once.

On a side note: I would love to know where that overhead comes from. It's so crazy that it hits everything, even VRAM--I honestly didn't believe GBATEK until I measured it myself. x_x (Seriously, if the GPU had to wait around that much, a pair of affine maps alone would eat up all the bandwidth!)

Quote:
This does mean your HDMA only needs to be 2, not 4.

Fair enough. :-)

Quote:
In fact, it might be possible to strength reduce the changes of y into a fixed dmy. Sadly still can't get around changing x every line.

I looked at this and came to the conclusion that the scaling throws a wrench in it, maybe I missed something.

#171340 - Cearn - Sat Nov 14, 2009 10:56 am

Very pretty, Cydrak (though it'd probably be better not to have two different variables called 'y' in the loop >_>)

Cydrak wrote:
On a side note: I would love to know where that overhead comes from. It's so crazy that it hits everything, even VRAM--I honestly didn't believe GBATEK until I measured it myself. x_x (Seriously, if the GPU had to wait around that much, a pair of affine maps alone would eat up all the bandwidth!)
Do you still have these tests somewhere? I've been considering to do these myself to get a better idea of all the instruction timings; this could save me some work.

Since we're going to write directly to VRAM, here's a little code to write a single byte into VRAM.

Code:
@ Memory in hwords looks like this:
@   address | 0 1 | 2 3 | 4 5 | etc
@   memory  | a b | a b | a b |
@ If dst is odd : put value v into b; even: put value in a.
@ if(addr & 1)
@   dst_h[addr>>1] = dst_b[addr-1]    | v<<8;
@ else
@   dst_h[addr>>1] = dst_b[addr+1]<<8 | v;

    @ r0 = address.
    @ r1 = value to write.
    @ r2,r3 : temps
    eor     r2, r0, #1              @ - read other byte.
    ldrb    r2, [r2]                @ /
    ands    r3, r0, #1
    orrne   r1, r2, r1, lsl #8      @ Prep for odd write.   v<<8 | a
    orreq   r1, r1, r2, lsl #8      @ Prep for even write.  b<<8 | v
    strh    r1, [r0, -r3]           @ Write back

#171343 - ninjalj - Sat Nov 14, 2009 3:17 pm

Cearn wrote:

Since we're going to write directly to VRAM, here's a little code to write a single byte into VRAM.

Code:
@ Memory in hwords looks like this:
@   address | 0 1 | 2 3 | 4 5 | etc
@   memory  | a b | a b | a b |
@ If dst is odd : put value v into b; even: put value in a.
@ if(addr & 1)
@   dst_h[addr>>1] = dst_b[addr-1]    | v<<8;
@ else
@   dst_h[addr>>1] = dst_b[addr+1]<<8 | v;

    @ r0 = address.
    @ r1 = value to write.
    @ r2,r3 : temps
    eor     r2, r0, #1              @ - read other byte.
    ldrb    r2, [r2]                @ /
    ands    r3, r0, #1
    orrne   r1, r2, r1, lsl #8      @ Prep for odd write.   v<<8 | a
    orreq   r1, r1, r2, lsl #8      @ Prep for even write.  b<<8 | v
    strh    r1, [r0, -r3]           @ Write back


I may be missing something, but doesn't swpb solve this, as used on DSLinux for the GBA cartridge bus? Maybe it depends on buffer/cache bits on the MPU? (I see DSLinux apparently has both Bd and Cd set, maybe setting the VRAM as cachable is not such a hot idea).
Code:

+#ifdef CONFIG_NDS_ROM8BIT
+       add     r6, r6, r8, lsr #8
+       swpb    r5, r7, [r6]
+#else
        strb    r7, [r6, r8, lsr #8]            @ set appropriate used_cp[]
+#endif

#171346 - Ant6n - Sat Nov 14, 2009 10:31 pm

You could assume that 8bit writes into your video memory on the x86 almost never happen, but 8bit writes into main memory happen a lot.
So you could protect your VRAM via the memory protection unit, and do all 8bit writes in user mode. all 16/32bit writes could be done in some privileged mode (i.e. run your emulator in privileged mode). You can force all 8bit writes to be a non-privileged access via the T flag for LDR, STR.
Now you can access all memory with 16bit writes, and main memory with 8bit writes. If you do an 8bit write into VRAM (which is suppossed to happen seldomly), you can set up the exception to just do your 8bit write.
That way 8byte writes into VRAM are really slow, but they don't slow down 8bit writes into normal memory (unless you are using VRAM as your x86 main memory, anyway)

I think the protection unit could actually be used for all sorts of nifty things.

#171361 - Pate - Mon Nov 16, 2009 5:55 am

Thanks for the interesting ideas and discussion, guys!

Cydrak, I think I almost understand your code, I'll have to test this out when I have the time. Thanks!

Ant6n, 8bit writes to screen RAM on x86 are very common in 320x200 256-color mode (as 1 pixel is 1 byte), so I don't think making them really slow is a very good idea. Also, running in privileged mode sounds scary. :-)

Last weekend I worked on emulating CGA display, with 2 bits per pixel, and interlaced memory (first 8KB in RAM are the even lines on screen, the other 8KB the odd line numbers). If you happen to have ideas about how to do this fast, I would be interested in those as well. I currently just use the same 512x256 8bpp bitmap, and expand all bytes to words (2 bits to a byte) when writing the data.

Thanks again!

Pate
_________________

#171373 - Exophase - Mon Nov 16, 2009 5:47 pm

CGA is effectively packed, right? I don't think that you'll get much leverage using any sub-8bpp graphics on DS, since you can't represent a contiguous framebuffer using them. And you only get 4bpp anyway, when you want 2bpp or 1bpp.

I would use a 1KB LUT to expand 8bit writes to 32bit ones, and possibly a 2KB one for the high resolution mode, although I doubt it's that heavily used to begin with.

Nice thing that the text modes use 8x8 tiles, those should be easy enough to emulate more directly.

#171469 - Ant6n - Sat Nov 21, 2009 10:00 pm

Btw, in order to crunch the 320x240 display into the DS display without having to use nearest neighbor resizing, you can add some slight high frequency jitter. I vaguely remember somebody used that in some demo to make textures look less blocky.
Basically you can render every other frame with it's base position something like .25 or .5 pixel off. so displayed pixels that should really be the average of two original pixels display one and the other on alternating frames. at 60 HZ your eye averages that out.
I wonder what the optimal jitter values are to get an image that most closely approximates useful resizing.
(sorry if this was already mentioned)

#171472 - Lazy1 - Sun Nov 22, 2009 12:35 am

Ant6n wrote:
Btw, in order to crunch the 320x240 display into the DS display without having to use nearest neighbor resizing, you can add some slight high frequency jitter. I vaguely remember somebody used that in some demo to make textures look less blocky.
Basically you can render every other frame with it's base position something like .25 or .5 pixel off. so displayed pixels that should really be the average of two original pixels display one and the other on alternating frames. at 60 HZ your eye averages that out.
I wonder what the optimal jitter values are to get an image that most closely approximates useful resizing.
(sorry if this was already mentioned)


Yes, this method works fairly well and the thread with more info can be found here.

#171489 - Pate - Mon Nov 23, 2009 5:53 am

Exophase: Thanks for the idea, I hadn't thought of using a LUT. Strange that it hand't occurred to me, as I use a LUT for a lot of other things! :-)

Ant6n & Lazy1: Yes, I'm familiar with that method, I found the thread you linked to a while ago. I plan to at least try that method when adding scaling support.

Thanks again for all your ideas!

Pate
_________________