Skip to content

Latest commit

 

History

History
84 lines (58 loc) · 4.01 KB

optimization.md

File metadata and controls

84 lines (58 loc) · 4.01 KB

Hand-optimizing C code by looking at the generated assembly

Purpose of this document is to present several optimization techniques which can be employed to speed up code generated by gcc for the m68k.

Our tool of choice is going to be tools/disass.py script, which produces a disassembly output along with cycle count per each instruction and per each function.

General rules of thumb:

  1. Less code = less cycles. This may sound trivial, but will be our guiding principle.
  2. Shorter instructions are faster (and consume less bus cycles) because the CPU doesn't have to fetch as many bytes

Put constants used in loops in registers

We want tight loops to have as little instructions as possible, preferably short ones. If you notice long instructions in a loop that are 3-4 bytes long and contain a constant, you should place that constant in a variable before the loop (that should effectively place it in a register) and use that variable inside the loop instead.

Use shorts instead of u_shorts

Compiler likes shorts. Try to avoid using u_shorts when you don't absolutely have to - shorts tend to generate fewer instructions because the compiler doesn't have to work around the sign bit in some cases. Always compare the generated assembly before and after the change though - in rare cases the generated assembly will be faster with u_shorts.

Optimizing loops - force use of dedicated looping instruction

Loops in the form of for (i = n - 1; i >= 0; i--) (or any other that can be rewritten into this form) can take advantage of a dedicated dbf - decrement and branch instruction. General form is:

short n = upper_limit - 1;
do {
    // ...
} while (--n != -1);

Avoid using globals in loops

Accessing memory with global variables makes the instruction longer and therefore slower. If you need to use a global symbol in your function, pass it as a parameter - you can pass up to 4 arguments in registers before the compiler will start putting them on the stack. They end up in registers d0, d1, a0, a1. If there are more than 2 data arguments, the compiler will put them in the address registers and vice versa. Alternatively you can force putting a specific variable in a specific register, e.g: register volatile void *ptr asm("a5") = &some_global;

Data structure design

Place data in memory in the order of access

By accessing consecutive elements in memory you can take advantage of a dedicated instruction on m68k which does both the memory access and pointer incrementing. Following example demonstrates how to make the compiler generate it, used with the previous technique for optimizing loops:

short *pts = (short *)some_array;
short n = some_array_length - 1;
do {
  short x = (*pts++);
  short y = (*pts++);
  // ...
} while (--n != -1)

This also works with structs and arrays of structs.

Here's a more complicated example for fast blitter configuration. In this case we take advantage of the fact that blitter's registers are laid out consecutively in memory. Starting at bltcon1's address we start filling consecutive memory addresses which correspond to some blitter registers. Order of all blitter registers (and more) can be checked in include/custom_regdef.h.

register volatile void *ptr asm("a5") = &custom_->bltcon1;

/* Comment it out if you're feeling lucky! */
_WaitBlitter(custom_);

*((short *)ptr)++ = shift;    // bltcon1
*((int *)ptr)++ = mask;       // bltaltwm & bltafwm
*((int *)ptr)++ = (int)dstpt; // bltcpt
*((int *)ptr)++ = (int)srcpt; // bltbpt
ptr += 4;                     // bltapt
*((int *)ptr)++ = (int)dstpt; // bltdpt
*((short *)ptr)++ = BLTSIZE;  // bltsize

Move as much calculations to the data as you (reasonably) can

Consider a simple example, where there's always some calculations taking place in a loop:

for (i = 0; i < n; i++) {
  short x = t[i] + SOME_CONST;
  // ...
}

If SOME_CONST is always the same, values in t can be modified to already have SOME_CONST added from the beginning, so that it doesn't have to be added in every loop iteration.