gbadev.org forum archive

Hi,

I'm looking for a "reverse" version of memcpy32(), which could be used to flip an image horizontally.

That wouldn't flip an image though, since pixels are 16-bit, not 32-bit.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Yes, the function would have to flip the 4 bytes (I need it for mode 4).

Just use DMA, with Source and Destination address going in reverse directions.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Thank you for the advice. But how do I flip the bytes? You can't set "Chunk Size" to 8 bit. The only option is DMA_16 and it only flips the halfwords:

[1][2][3][4] -> [3][4][1][2]

What I need is:

[1][2][3][4] -> [4][3][2][1]

Sorry, forgot that was an 8-bit mode, thought you were using 16 bit pixels...

Off the top of my head, untested, check it for bugs...

Code:

void copybackwards(u16 *src, u16* dest, int size) //size is in halfwords
{
src += size-1;
while (size > 0)
{
   int a = *src--;
   a = (a >> 8) | ((a & 0xFF)<<8);
   *dest++ = a;
   size--;
}
}

post edited, forgot a left shift
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Thank you. Do you think the code can be made faster using ASM?

There's the 32-bit algorithm to swap bytes within a word:

Code:

M1 = 00FF00FF
M2 = 0000FFFF

A = src[xxxx]

B = M1 & (A >> 8) //00FF00FF
C = A & M1
A = B | (C << 8)
B = M2 & (A >> 16) //0000FFFF
C = (A & M2)
A = B | (C << 16)

dest[xxxx] = A

And the 16 bit algorithm:

Code:

A = src[xxxx]
B = (A << 8)
A = (A >> 8) | B
dest[xxxx] = A

Let's assume source is the cartridge and dest is VRAM, and this is a GBA.

32 bit takes 20 cycles to copy and swap 4 bytes. (4 more when you make it loop)
16 bit takes 11 cycles to copy and swap 2 bytes. (4 more when you make it loop)

Note, I never did get the hang of timing ARM instructions correctly, with the waitstates and all that, might be mistakes in the timing.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Building on Dwedit's post...

Code:

@ Reverse memcpy.
@ Stores 16-bits
@ not very optimized, but
@ gets the job done AFAIK.
@ Cost is roughly 3 + 8x
@ where x is the number of
@ bytes to be copied.
@ ---
@ r0: dst
@ r1: src
@ r2: cnt [in bytes]

memcpyr16:
subs r2, #2 @ 1 ( 1)
ldrcsh r3, [r1, r2] @ ~6 ( 7)
movcs ip, r3, lsr #8 @ 1 ( 8)
orrcs r3, ip, r3, lsl #8 @ 1 ( 9)
strcsh r3, [r0], #2 @ ~4 (13)
bne memcpyr16 @ 3 (16)
bx lr @ 3 (19)

EDIT: Fixed formatting and added timing + typo

Any chance this image is a sprite and you could use the X-Flip bit? Or perhaps a tile based BG in which you could also use the same bit? If it's a 3D layer or something that allows scaling, you could negative X scale to get this effect as well.

32bit-based version using some ROR <3.

Code:

/*!
@function void memrcpy32(const void *src, void *dst, uint size);
Byte-reverse copies \a size/4 words from \a src to \a dst.
@param src Source pointer.
@param dst Destination pointer. Points to the START of the buffer.
@param size number of bytes to copy.
@note Kinda expects word-alignment for everything (for now).
*/
.section .iwram, "ax", %progbits
.arm
.align
.global memrcpy32
memrcpy32:
bics r2, r2, #3 @ word-align size,
bxeq lr @ and perhaps quick escape.
stmfd sp!, {r4}

add r1, r1, r2 @ point dst to its (not it's) tail.
ldr ip,=0x00FF00FF
.LrcpyLoop:
ldmia r0!, {r3} @ r3: abcd ; r3= *src++;
and r4, ip, r3, ror #16 @ r4: 0d0b
and r3, ip, r3, ror #24 @ r3: 0c0a
orr r3, r3, r4, lsl #8 @ r3: dcba
stmdb r1!, {r3} @ ; *--dst= r3;
subs r2, r2, #4
bne .LrcpyLoop

ldmfd sp!, {r4}
bx lr

Of course, this only works if everything is word aligned. If not, you'll have to account for the misalignments, as well as deal with the head and tail ending in the middle of a word.

If possible, use sprites for image flipping.

Can't you avoid the "add r1, r1, r2" by instead chaging the loop to...

Code:

.LrcpyLoop:
subs r2, r2, #4
ldrcs r3, [r0, r2] @ r3: abcd ; r3= src[end--]
andcs r4, ip, r3, ror #16 @ r4: 0d0b
andcs r3, ip, r3, ror #24 @ r3: 0c0a
orrcs r3, r3, r4, lsl #8 @ r3: dcba
stmcsia r1!, {r3} @ ; *dst++ = r3;
bhi .LrcpyLoop

Or does that screw up the sequential access timing?

gbadev.org forum archive

ASM > "reverse" memcpy32()

#174987 - Kensai - Fri Aug 13, 2010 8:27 pm

#174988 - Dwedit - Fri Aug 13, 2010 9:14 pm

#174989 - Kensai - Fri Aug 13, 2010 9:23 pm

#174990 - Dwedit - Fri Aug 13, 2010 9:55 pm

#174991 - Kensai - Fri Aug 13, 2010 11:13 pm

#174992 - Dwedit - Sat Aug 14, 2010 1:13 am

#174998 - Kensai - Sat Aug 14, 2010 5:59 pm

#174999 - Dwedit - Sat Aug 14, 2010 9:13 pm

#175009 - Ruben - Sun Aug 15, 2010 12:37 pm

#175020 - Miked0801 - Mon Aug 16, 2010 9:37 pm

#175027 - Cearn - Wed Aug 18, 2010 8:36 pm

#175030 - Ruben - Thu Aug 19, 2010 9:10 am