gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > "reverse" memcpy32()

#174987 - Kensai - Fri Aug 13, 2010 8:27 pm

Hi,

I'm looking for a "reverse" version of memcpy32(), which could be used to flip an image horizontally.

#174988 - Dwedit - Fri Aug 13, 2010 9:14 pm

That wouldn't flip an image though, since pixels are 16-bit, not 32-bit.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#174989 - Kensai - Fri Aug 13, 2010 9:23 pm

Yes, the function would have to flip the 4 bytes (I need it for mode 4).

#174990 - Dwedit - Fri Aug 13, 2010 9:55 pm

Just use DMA, with Source and Destination address going in reverse directions.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#174991 - Kensai - Fri Aug 13, 2010 11:13 pm

Thank you for the advice. But how do I flip the bytes? You can't set "Chunk Size" to 8 bit. The only option is DMA_16 and it only flips the halfwords:

[1][2][3][4] -> [3][4][1][2]

What I need is:

[1][2][3][4] -> [4][3][2][1]

#174992 - Dwedit - Sat Aug 14, 2010 1:13 am

Sorry, forgot that was an 8-bit mode, thought you were using 16 bit pixels...

Off the top of my head, untested, check it for bugs...

Code:

void copybackwards(u16 *src, u16* dest, int size) //size is in halfwords
{
   src += size-1;
   while (size > 0)
   {
      int a = *src--;
      a = (a >> 8) | ((a & 0xFF)<<8);
      *dest++ = a;
      size--;
   }
}


post edited, forgot a left shift
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#174998 - Kensai - Sat Aug 14, 2010 5:59 pm

Thank you. Do you think the code can be made faster using ASM?

#174999 - Dwedit - Sat Aug 14, 2010 9:13 pm

There's the 32-bit algorithm to swap bytes within a word:
Code:

M1 = 00FF00FF
M2 = 0000FFFF

A = src[xxxx]

B = M1 & (A >> 8) //00FF00FF
C = A & M1
A = B | (C << 8)
B = M2 & (A >> 16) //0000FFFF
C = (A & M2)
A = B | (C << 16)

dest[xxxx] = A


And the 16 bit algorithm:

Code:

A = src[xxxx]
B = (A << 8)
A = (A >> 8) | B
dest[xxxx] = A


Let's assume source is the cartridge and dest is VRAM, and this is a GBA.

32 bit takes 20 cycles to copy and swap 4 bytes. (4 more when you make it loop)
16 bit takes 11 cycles to copy and swap 2 bytes. (4 more when you make it loop)

Note, I never did get the hang of timing ARM instructions correctly, with the waitstates and all that, might be mistakes in the timing.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#175009 - Ruben - Sun Aug 15, 2010 12:37 pm

Building on Dwedit's post...

Code:
@ Reverse memcpy.
@ Stores 16-bits
@ not very optimized, but
@ gets the job done AFAIK.
@ Cost is roughly 3 + 8x
@ where x is the number of
@ bytes to be copied.
@ ---
@ r0: dst
@ r1: src
@ r2: cnt [in bytes]

memcpyr16:
    subs   r2, #2             @  1 ( 1)
    ldrcsh r3, [r1, r2]       @ ~6 ( 7)
    movcs  ip, r3, lsr #8     @  1 ( 8)
    orrcs  r3, ip, r3, lsl #8 @  1 ( 9)
    strcsh r3, [r0], #2       @ ~4 (13)
    bne    memcpyr16          @  3 (16)
    bx     lr                 @  3 (19)


EDIT: Fixed formatting and added timing + typo

#175020 - Miked0801 - Mon Aug 16, 2010 9:37 pm

Any chance this image is a sprite and you could use the X-Flip bit? Or perhaps a tile based BG in which you could also use the same bit? If it's a 3D layer or something that allows scaling, you could negative X scale to get this effect as well.

#175027 - Cearn - Wed Aug 18, 2010 8:36 pm

32bit-based version using some ROR <3.
Code:
/*!
    @function void memrcpy32(const void *src, void *dst, uint size);
    Byte-reverse copies \a size/4 words from \a src to \a dst.
    @param  src     Source pointer.
    @param  dst     Destination pointer. Points to the START of the buffer.
    @param  size    number of bytes to copy.
    @note   Kinda expects word-alignment for everything (for now).
*/
    .section .iwram, "ax", %progbits
    .arm
    .align
    .global memrcpy32
memrcpy32:
    bics    r2, r2, #3              @ word-align size,
    bxeq    lr                      @ and perhaps quick escape.
    stmfd   sp!, {r4}
   
    add     r1, r1, r2              @ point dst to its (not it's) tail.
    ldr     ip,=0x00FF00FF
.LrcpyLoop:
        ldmia   r0!, {r3}               @ r3: abcd  ; r3= *src++;
        and     r4, ip, r3, ror #16     @ r4: 0d0b
        and     r3, ip, r3, ror #24     @ r3: 0c0a
        orr     r3, r3, r4, lsl #8      @ r3: dcba
        stmdb   r1!, {r3}               @           ; *--dst= r3;
        subs    r2, r2, #4
        bne     .LrcpyLoop

    ldmfd   sp!, {r4}
    bx      lr

Of course, this only works if everything is word aligned. If not, you'll have to account for the misalignments, as well as deal with the head and tail ending in the middle of a word.

If possible, use sprites for image flipping.

#175030 - Ruben - Thu Aug 19, 2010 9:10 am

Can't you avoid the "add r1, r1, r2" by instead chaging the loop to...
Code:
.LrcpyLoop:
        subs    r2, r2, #4
        ldrcs   r3, [r0, r2]            @ r3: abcd ; r3= src[end--]
        andcs   r4, ip, r3, ror #16     @ r4: 0d0b
        andcs   r3, ip, r3, ror #24     @ r3: 0c0a
        orrcs   r3, r3, r4, lsl #8      @ r3: dcba
        stmcsia r1!, {r3}               @          ; *dst++ = r3;
        bhi    .LrcpyLoop
Or does that screw up the sequential access timing?