Many years ago I started writing on a document on how to write efficient code for the Atari
Falcon030. Before I was able to finish the document, university studies started requiring more
and more of my time and eventually I found myself with no time left for the Atari. I forgot about
the document but the other day I found it on my hard disk.
As more and more people seem to be interested in programming the Falcon030 again I have decided
to spend some time on finishing the document, or at least parts of it. The first section is about
the caches of the Motorola 68030 as I consider good knowledge about the caches being the single most
important factor in being able to write efficient code for the Falcon030.
If you read this and find any of the information useful, please let me know! It will help
me stay motivated, and to finish and publish more parts. You can find my e-mail address
here.
Thanks,
Daniel Hedberg
68030 Caches
Memory access is very expensive on the Falcon030 due to its poor 16-bit data bus, which is also
used by various other chips besides the CPU. Maximizing the performance requires good use of the
internal CPU caches.
The Motorola 68030 CPU of the Falcon030 has two on-chip caches, an instruction cache and a data
cache. The caches improve performance by reducing external bus activity.
Unfortunately, the caches are somewhat crippled by the 16-bit bus and the fact that the
architecture of the Falcon030 does not support the burst modes of the caches, but even with a
16-bit bus and without burst modes the caches are extremely useful.
The caches are controlled by the CACR: Cache Control Register.
CACR: Cache Control Register |
13 | 12 | 11 | 10 | 9 | 8 | | | 4 | 3 | 2 | 1 | 0 |
WA | DBE | CD | CED | FD | ED | | | IBE | CI | CEI | FI | EI |
WA: | Write Allocate |
DBE: | Data Cache Burst Enable - Not supported by the Falcon030! |
CD: | Clear Data Cache |
CED: | Clear Entry in Data Cache |
FD: | Freeze Data Cache |
ED: | Enable Data Cache |
IBE: | Instruction Cache Burst Enable - Not supported by the Falcon030! |
CI: | Clear Instruction Cache |
CEI: | Clear Entry in Instruction Cache |
FI: | Freeze Instruction Cache |
EI: | Enable Instruction Cache |
The CACR register can be written to using the movec instruction, e.g., to enable
and clear the data cache you would do the following:
movec cacr,d0
bset #8,d0 ; Enable data cache
bset #11,d0 ; Clear data cache
movec d0,cacr
Instruction Cache
The instruction cache has 256 bytes with 16 lines (4 longwords each).
All longwords/entries of a cache line share the same tag value (the 24 most significant address
bits).
The instruction cache is fairly simple. Instructions are loaded into the instruction cache, one
longword at a time as the instructions are fetched from memory. Whenever an instruction is executed
multiple times and the memory address of the instruction is present in the cache, the instructon is
fetched from the cache rather than from memory.
Even without taking the instruction cache into consideration when writing code you will benefit
from it, but with some extra care you can easily boost the performance of your code even more:
- UNWIND LOOPS
Unwind loops to maximize the use of the cache, but make sure the size of the loop does not exceed
256 bytes.
- INLINE SMALL SUBROUTINES (FUNCTIONS)
If you call a subroutine from a loop that fits in the cache, be aware that the instructions of the
subroutine may overwrite the instructions of the loop in the cache. This happens if any of the
addresses of the code in the loop and the subroutine overlap with respect to their least
significant byte. If your subroutine is small enough to fit in the cache along with the
loop, you should inline the subroutine to avoid this from happening. Small subroutines can be
rewritten as parameterized macros, making inlining easy.
- FREEZE BEFORE BRANCHING OUT OF LOOPS
If you branch out to subroutines from a loop that fits in the cache, and the subroutines cannot be
inlined without exceeding the size of the cache you should freeze the cache before branching out
and unfreeze it when you return. This will avoid the cache to be reloaded multiple times due to
collisions in the cache.
- MARK INTERRUPT SERVICE CODE AS NON-CACHEABLE
If you make use of timers or other interrupts occuring many times during a VBL, the performance
of the instruction cache will suffer as the interrupt service code will overwrite entries in the
cache every time its executed. The best solution to avoid the interrupt service code to mess with
the contents of the cache is to reprogram the PMMU to make the address space used by the interrupt
handlers non-cacheable. To some extent this advice also applies to the data cache, but only
under very specific circumstances.
Data Cache
The data cache has 256 bytes with 16 lines (4 longwords each).
All longwords/entries of a cache line share the same tag value (the 24 most significant address
bits).
The data cache is a write-through cache, which means that on memory writes, the CPU writes to
memory even if a cache hit occurs. In the event of a cache hit, the cache is also updated, even
if the cache is frozen. On a cache miss, the cache is only updated if the WA (Write Allocate) bit
is set.
On memory reads, the data is fetched from the cache rather than from memory if the memory
address of the data is present in the cache. If a cache miss occurs, the data is read from memory
and the cache is updated.
On avarage, enabling the data cache should boost the performance, but on the Falcon030 that is
not always true due to its 16-bit bus. It may very well be that enabling the data cache causes a
drop in performance. The path to success lies in identifying the parts of your code where the data
cache causes an overhead, and the parts of your code where the data cache can be beneficial, and
adapt your code accordingly.
Whether you should keep the data cache on by default and disable/freeze it when needed, or
keep it off by default and enable/freeze it when needed will depend on what your code looks like.
If you take the time to analyze your code and follow the advice below it should not make much of
a difference. Personally I usually keep the data cache enabled by default.
- CLEAR WHEN ENABLING
Always clear the data cache when enabling it to avoid possible memory/cache inconsistencies.
- DISABLE BEFORE COPYING LARGE AMOUNT OF DATA
When copying large amounts of data using instructions with read-write memory accesses in a loop,
for example move.l (a0)+,(a1)+ , disable the data cache before the loop to avoid
the overhead of having the cache being updated in between reads and writes. When enabling the
cache again, remember to clear it if you plan to read from the memory written to as the cache will
not be up to date.
- PRELOAD AND FREEZE
If you have a loop where you frequently access some addresses in memory, such as entries in a
small look-up table, preload the data cache with the data by reading from those memory addresses
and then freeze the cache before entering the loop. Avoid collisions in the cache by ensuring that
the preloaded data is sequential (a single block of memory) and that its size is not exceeding
256 bytes. By freezing the cache you prevent having the preloaded data in the cache being
overwritten when accessing other memory addresses during loop iterations. Do not forget to
unfreeze the cache when the loop is finished (clearing the cache when unfreezing it is not needed).
Please note that access to the cached memory addresses are not restricted to reads, you may write
to them as well as any cached data is updated in the cache even when the cache is frozen! The
memory addresses in the cache are frozen but the data is not. Writes will not be cheaper though,
as the data cache is a write-through cache.
- AVOID STALE DATA
Stale data is when the data in the cache no longer matches the data in memory. Stale data
conditions can arise if the data cache is enabled while other DMA devices (such as the Blitter)
are writing to the memory, effectively bypassing the data cache. This can easily be avoided by
disabling the data cache while other DMA devices are operating, and clearing the data cache before
enabling it again.
If the data cache is configured to use the no write allocation mode (WA=0), stale data conditions
can also arise when no other device than the CPU is accessing the memory. These situations are
fairly rare and either involves accessing the same physical memory address using two or more
logical addresses, or accessing the same physical address using different program space
encodings.
When reading the chapter on the caches in the MC68030 User's Manual, at least I got the impression
that unless you are messing with the MMU or switching back and forth between user and supervisor
mode while accessing the same physical memory addresses, you have nothing to worry about. Not true.
There is actually one situation which is not too uncommon and which is not described explicitly in
the manual: PC-relative addressing.
PC-relative addressing modes can only be used for reading data and are often used as a form of
optimization. However, for PC-relative addressing modes, the reference is a PROGRAM space
reference, while for any other addressing mode, the reference is a DATA space reference. This is
of importance as the data cache stores data references to any address space (except CPU space),
and the address space is part of the tag of each line in the data cache.
So, when reading from a specific memory location using PC-relative addressing, the data cache will
be updated with a PROGRAM SPACE reference. If you at a later time write to that memory location
using any other addressing mode (remember that PC-relative addressing cannot be used with writes),
the data cache will not be updated due to the difference in program space. The data in the cache
no longer matches that in memory and is stale.
As a consequence, the next time you read from the memory location using a PC-relative addressing
mode, there will be a cache hit and the value you obtain is the stale data in the data cache.
If you plan to use PC-relative addressing with the data cache enabled, either restrict it and use
it only for const data (data that is never modified), or enable the write allocation mode (WA=1)
of the data cache.
|