Purpose of this document is to present several optimization techniques which can be employed to speed up code generated by gcc for the m68k.
Our tool of choice is going to be tools/disass.py
script, which produces a disassembly output along with cycle count per each instruction and per each function.
- Less code = less cycles. This may sound trivial, but will be our guiding principle.
- Shorter instructions are faster (and consume less bus cycles) because the CPU doesn't have to fetch as many bytes
We want tight loops to have as little instructions as possible, preferably short ones. If you notice long instructions in a loop that are 3-4 bytes long and contain a constant, you should place that constant in a variable before the loop (that should effectively place it in a register) and use that variable inside the loop instead.
Compiler likes shorts
. Try to avoid using u_short
s when you don't absolutely have to - short
s tend to generate fewer instructions because the compiler doesn't have to work around the sign bit in some cases. Always compare the generated assembly before and after the change though - in rare cases the generated assembly will be faster with u_short
s.
Loops in the form of for (i = n - 1; i >= 0; i--)
(or any other that can be rewritten into this form) can take advantage of a dedicated dbf
- decrement and branch instruction. General form is:
short n = upper_limit - 1;
do {
// ...
} while (--n != -1);
Accessing memory with global variables makes the instruction longer and therefore slower. If you need to use a global symbol in your function, pass it as a parameter - you can pass up to 4 arguments in registers before the compiler will start putting them on the stack. They end up in registers d0, d1, a0, a1. If there are more than 2 data arguments, the compiler will put them in the address registers and vice versa.
Alternatively you can force putting a specific variable in a specific register, e.g:
register volatile void *ptr asm("a5") = &some_global;
By accessing consecutive elements in memory you can take advantage of a dedicated instruction on m68k which does both the memory access and pointer incrementing. Following example demonstrates how to make the compiler generate it, used with the previous technique for optimizing loops:
short *pts = (short *)some_array;
short n = some_array_length - 1;
do {
short x = (*pts++);
short y = (*pts++);
// ...
} while (--n != -1)
This also works with structs and arrays of structs.
Here's a more complicated example for fast blitter configuration. In this case we take advantage of the fact that blitter's registers are laid out consecutively in memory. Starting at bltcon1
's address we start filling consecutive memory addresses which correspond to some blitter registers. Order of all blitter registers (and more) can be checked in include/custom_regdef.h
.
register volatile void *ptr asm("a5") = &custom_->bltcon1;
/* Comment it out if you're feeling lucky! */
_WaitBlitter(custom_);
*((short *)ptr)++ = shift; // bltcon1
*((int *)ptr)++ = mask; // bltaltwm & bltafwm
*((int *)ptr)++ = (int)dstpt; // bltcpt
*((int *)ptr)++ = (int)srcpt; // bltbpt
ptr += 4; // bltapt
*((int *)ptr)++ = (int)dstpt; // bltdpt
*((short *)ptr)++ = BLTSIZE; // bltsize
Consider a simple example, where there's always some calculations taking place in a loop:
for (i = 0; i < n; i++) {
short x = t[i] + SOME_CONST;
// ...
}
If SOME_CONST
is always the same, values in t
can be modified to already have SOME_CONST
added from the beginning, so that it doesn't have to be added in every loop iteration.