programming the Falcon030

Many years ago I started writing on a document on how to write efficient code for the Atari Falcon030. Before I was able to finish the document, university studies started requiring more and more of my time and eventually I found myself with no time left for the Atari. I forgot about the document but the other day I found it on my hard disk.

As more and more people seem to be interested in programming the Falcon030 again I have decided to spend some time on finishing the document, or at least parts of it. The first section is about the caches of the Motorola 68030 as I consider good knowledge about the caches being the single most important factor in being able to write efficient code for the Falcon030.

If you read this and find any of the information useful, please let me know! It will help me stay motivated, and to finish and publish more parts. You can find my e-mail address here.

Thanks,
Daniel Hedberg

68030 Caches

Memory access is very expensive on the Falcon030 due to its poor 16-bit data bus, which is also used by various other chips besides the CPU. Maximizing the performance requires good use of the internal CPU caches.

The Motorola 68030 CPU of the Falcon030 has two on-chip caches, an instruction cache and a data cache. The caches improve performance by reducing external bus activity.

Unfortunately, the caches are somewhat crippled by the 16-bit bus and the fact that the architecture of the Falcon030 does not support the burst modes of the caches, but even with a 16-bit bus and without burst modes the caches are extremely useful.

The caches are controlled by the CACR: Cache Control Register.

CACR: Cache Control Register
13	12	11	10	9	8	\|	4	3	2	1	0
WA	DBE	CD	CED	FD	ED	\|	IBE	CI	CEI	FI	EI

WA:	Write Allocate
DBE:	Data Cache Burst Enable - Not supported by the Falcon030!
CD:	Clear Data Cache
CED:	Clear Entry in Data Cache
FD:	Freeze Data Cache
ED:	Enable Data Cache

IBE:	Instruction Cache Burst Enable - Not supported by the Falcon030!
CI:	Clear Instruction Cache
CEI:	Clear Entry in Instruction Cache
FI:	Freeze Instruction Cache
EI:	Enable Instruction Cache

The CACR register can be written to using the movec instruction, e.g., to enable and clear the data cache you would do the following:

movec cacr,d0 bset #8,d0 ; Enable data cache bset #11,d0 ; Clear data cache movec d0,cacr

Instruction Cache

The instruction cache has 256 bytes with 16 lines (4 longwords each).
All longwords/entries of a cache line share the same tag value (the 24 most significant address bits).

The instruction cache is fairly simple. Instructions are loaded into the instruction cache, one longword at a time as the instructions are fetched from memory. Whenever an instruction is executed multiple times and the memory address of the instruction is present in the cache, the instructon is fetched from the cache rather than from memory.

Even without taking the instruction cache into consideration when writing code you will benefit from it, but with some extra care you can easily boost the performance of your code even more:

UNWIND LOOPS
Unwind loops to maximize the use of the cache, but make sure the size of the loop does not exceed 256 bytes.

INLINE SMALL SUBROUTINES (FUNCTIONS)
If you call a subroutine from a loop that fits in the cache, be aware that the instructions of the subroutine may overwrite the instructions of the loop in the cache. This happens if any of the addresses of the code in the loop and the subroutine overlap with respect to their least significant byte. If your subroutine is small enough to fit in the cache along with the loop, you should inline the subroutine to avoid this from happening. Small subroutines can be rewritten as parameterized macros, making inlining easy.

FREEZE BEFORE BRANCHING OUT OF LOOPS
If you branch out to subroutines from a loop that fits in the cache, and the subroutines cannot be inlined without exceeding the size of the cache you should freeze the cache before branching out and unfreeze it when you return. This will avoid the cache to be reloaded multiple times due to collisions in the cache.

MARK INTERRUPT SERVICE CODE AS NON-CACHEABLE
If you make use of timers or other interrupts occuring many times during a VBL, the performance of the instruction cache will suffer as the interrupt service code will overwrite entries in the cache every time its executed. The best solution to avoid the interrupt service code to mess with the contents of the cache is to reprogram the PMMU to make the address space used by the interrupt handlers non-cacheable. To some extent this advice also applies to the data cache, but only under very specific circumstances.

Data Cache

The data cache has 256 bytes with 16 lines (4 longwords each).
All longwords/entries of a cache line share the same tag value (the 24 most significant address bits).

The data cache is a write-through cache, which means that on memory writes, the CPU writes to memory even if a cache hit occurs. In the event of a cache hit, the cache is also updated, even if the cache is frozen. On a cache miss, the cache is only updated if the WA (Write Allocate) bit is set.

On memory reads, the data is fetched from the cache rather than from memory if the memory address of the data is present in the cache. If a cache miss occurs, the data is read from memory and the cache is updated.

On avarage, enabling the data cache should boost the performance, but on the Falcon030 that is not always true due to its 16-bit bus. It may very well be that enabling the data cache causes a drop in performance. The path to success lies in identifying the parts of your code where the data cache causes an overhead, and the parts of your code where the data cache can be beneficial, and adapt your code accordingly.

Whether you should keep the data cache on by default and disable/freeze it when needed, or keep it off by default and enable/freeze it when needed will depend on what your code looks like. If you take the time to analyze your code and follow the advice below it should not make much of a difference. Personally I usually keep the data cache enabled by default.

CLEAR WHEN ENABLING
Always clear the data cache when enabling it to avoid possible memory/cache inconsistencies.

DISABLE BEFORE COPYING LARGE AMOUNT OF DATA
When copying large amounts of data using instructions with read-write memory accesses in a loop, for example move.l (a0)+,(a1)+, disable the data cache before the loop to avoid the overhead of having the cache being updated in between reads and writes. When enabling the cache again, remember to clear it if you plan to read from the memory written to as the cache will not be up to date.

PRELOAD AND FREEZE
If you have a loop where you frequently access some addresses in memory, such as entries in a small look-up table, preload the data cache with the data by reading from those memory addresses and then freeze the cache before entering the loop. Avoid collisions in the cache by ensuring that the preloaded data is sequential (a single block of memory) and that its size is not exceeding 256 bytes. By freezing the cache you prevent having the preloaded data in the cache being overwritten when accessing other memory addresses during loop iterations. Do not forget to unfreeze the cache when the loop is finished (clearing the cache when unfreezing it is not needed).

Please note that access to the cached memory addresses are not restricted to reads, you may write to them as well as any cached data is updated in the cache even when the cache is frozen! The memory addresses in the cache are frozen but the data is not. Writes will not be cheaper though, as the data cache is a write-through cache.

AVOID STALE DATA
Stale data is when the data in the cache no longer matches the data in memory. Stale data conditions can arise if the data cache is enabled while other DMA devices (such as the Blitter) are writing to the memory, effectively bypassing the data cache. This can easily be avoided by disabling the data cache while other DMA devices are operating, and clearing the data cache before enabling it again.

If the data cache is configured to use the no write allocation mode (WA=0), stale data conditions can also arise when no other device than the CPU is accessing the memory. These situations are fairly rare and either involves accessing the same physical memory address using two or more logical addresses, or accessing the same physical address using different program space encodings.

When reading the chapter on the caches in the MC68030 User's Manual, at least I got the impression that unless you are messing with the MMU or switching back and forth between user and supervisor mode while accessing the same physical memory addresses, you have nothing to worry about. Not true. There is actually one situation which is not too uncommon and which is not described explicitly in the manual: PC-relative addressing.

PC-relative addressing modes can only be used for reading data and are often used as a form of optimization. However, for PC-relative addressing modes, the reference is a PROGRAM space reference, while for any other addressing mode, the reference is a DATA space reference. This is of importance as the data cache stores data references to any address space (except CPU space), and the address space is part of the tag of each line in the data cache.
So, when reading from a specific memory location using PC-relative addressing, the data cache will be updated with a PROGRAM SPACE reference. If you at a later time write to that memory location using any other addressing mode (remember that PC-relative addressing cannot be used with writes), the data cache will not be updated due to the difference in program space. The data in the cache no longer matches that in memory and is stale.
As a consequence, the next time you read from the memory location using a PC-relative addressing mode, there will be a cache hit and the value you obtain is the stale data in the data cache.

If you plan to use PC-relative addressing with the data cache enabled, either restrict it and use it only for const data (data that is never modified), or enable the write allocation mode (WA=1) of the data cache.