Literature study
xdebu@enea.se
2000-09-19
Abstract
Introduction to OSE
Processes
Communication
Interrupts
PowerPC
Locking
Evaluation
Prediction
Background
Explanations
References
Abstract
The
time it takes from that the external unit signals an interrupt to the time it’s
handled by the software is called interrupt latency. It’s of major importance
that this latency should be as short as possible. The CPU, it’s cache
organization, the operating system, the device drivers and even the
applications being run strongly affect the interrupt latency of the system.
Research on how to use the cache memory to improve the maximum interrupt
latency is almost missing completely today. Despite the fact that it’s of major
importance for many users of real-time operating systems. The research that
exists has no relation to and cannot be applied on the real systems that are
being built by the industry. Today’s modern processor architectures make this
problem much more difficult then it has been in earlier processor generations.
It is
clear that the interrupt latency in a system can be shortened by locking parts
of code into the cache. But this will probably have negative effects on the
performance on the rest of the system, but how much? How much code needs to be
looked into the cache to get an improvement that is noticeable and guaranteed?
How do you identify what code should be locked? Can locking also be done in the
data cache to improve performance of the system? And could the cache locality
be done better in the interrupt handling of the operating system.
Another
way to improve the interrupt latency is to reduce the longest region in the
operating system, device drivers, memory protection system and the application
in which interrupts are disabled. The longest region probably contains a
complex behavior with a lot of iterations and a great number of conditional
branches. A thorough analysis of what causes a certain code sequence to have a
long execution time is needed. An example of such a region is the scheduler of
the operating system.
Processes
OSE is a so-called real-time operating system (RTOS) and to be more precise a “soft” RTOS. Soft meaning that the process with the highest priority is allowed to run until it is either finished or the process enters a waiting state. A process can be in three different states, waiting, running or ready. In the waiting state the process waits for some event to occur, until then it doesn’t need any system resources and processes with lower priority are allowed to run. When in running state the process has the highest priority of all ready processes. When in ready state the process is ready to run but must wait until all other processes with higher priority has finished their execution or entered a waiting stats. When more than one process has the same priority and both are ready to run OSE uses a round robin scheme. A prioritized process is always placed in the end of the queue when it suspends it self.
There are five types of processes in OSE. Interrupt, Timer interrupt, Prioritized, Background and Phantom processes. Interrupt processes are called in the event of a hardware interrupt or a software event like a signal or triggering of a semaphore. Timer interrupt processes are called in response to changes in the system timer. Prioritized processes are the most common type in OSE and are written as infinite loops. A prioritized process will run as long as no other process with higher priority wants to run. A background process always has a lower priority than a prioritized process. They run only when no prioritized process wants to run and strict time-sharing is used by OSE to run background processes. When a background process has completed it’s time slice given to it by OSE the process is pre-empted by the system and the next background process in the ready queue is allowed to run. Phantom processes are special processes, they don’t contain any executable code and signals sent to a phantom process are lost. The primary use of a Phantom process is together with a redirection table form a logical channel for communicating across target boundaries. All processes can either be static or dynamic. They can be created and destroyed during run-time.

The common way of communicating between processes in OSE is by using signals. A signal is a message sent from one process to another process. For a process being able to send a message to another process it must know the identity of the receiving process. Static processes can be declared public and then other processes can find out their identity by declaring them external. To find out the identity of a dynamic process is a process can talk to the parent process that created the dynamic process. When a process sends a signal to another process with higher priority the sending process is immediately pre-empted and the receiving process is allowed to run and handle the signal. Each process has a queue for signals sent to it and the process can choose which signal it wants to receive and in what order the signals will be processed. OSE also supports redirection; a process receiving a signal it can forward it to another process. Each process has it’s own redirection table for redirecting signals.
Another way of communicating between processes is by using semaphores. Semaphores are faster than signals but they can’t carry any data. OSE supports two types of semaphores, “fast” and normal. Each process has its own fast semaphore and only the process that owns the fast semaphore can issue a wait for it. Normal semaphores don’t have an owner, any prioritized or background process can wait on a semaphore. If a process with lower priority signals a semaphore that a process with higher priority is waiting for then the lower priority process is immediately preempted and the higher priority process is allowed to run.
When an event occurs it must be dealt with and some interrupt process is called to handle the event. There are three ways an interrupt can be triggered, by a real hardware interrupt, software event or a timer event. A software event can be receiving a signal from another process or some other process triggered the fast semaphore the process was waiting for. A timer interrupt happens on some event in the system tick counter, for each tick the kernel checks if it supposed to start a timer interrupt. The sum of all timer interrupt processes must be less than one tick else the system looses a tick. To keep interrupt processes small you can move part of the code to a prioritized process and let the interrupt process only handle the most crucial work and let the prioritized process take over and to the rest of the work at a more suitable level of priority.
An interrupt process is a normal process with a few important exceptions. All interrupt processes have a higher priority than any prioritized process. A prioritized process can therefore never preempt an interrupt process. An interrupt process runs from beginning to end and then gives the control back to the system. Only when a interrupt process with higher priority is called a interrupt process can be preempted but all interrupt processes must finish before any prioritized process is allowed to run. An interrupt process may never enter a waiting state, it may never suspend it self. An interrupt process may on the other side issue a receive call but the receive call will never suspend the interrupt process. The CPU hardware interrupt levels must match the logical interrupt priorities in an OSE system. Interrupt processes respond faster to events and are ideal for sporadic but response-time critical tasks.

Locking
Actual locking of a memory block into the cache is not hard at all. The address of the memory block to be locked must be known and then it’s only a matter of a couple of assembler instructions to lock the memory block in cache. To lock a function the location of the function in the memory must be known and how long the function is, i.e. the number of instructions that make up the function. Finding out the memory address is simple; just send a pointer to the function you want to lock. But what function or functions should be locked. A nice hack would be to have the locking function find the end of the function we want to lock and then lock the entire function. But it’s hard to find the end of the function simply by looking at the assembler code; it’s not as simple as just looking for a branch return. Compiler optimizations and inline functions can mess things up and you may end up with a slower system. A not so pretty way of locking the a function is to first compile then in a debugger count the number of instructions and then recompile again. It’s not pretty but that’s how it works at the moment. But maybe an entire function doesn’t need to be locked. What if the execution spends the most of the time in a small part of the function, then it would be a waste of cache memory if the entire function is locked.
Unfortunately there’s more to it, remember we only have a 2-way associative cache. If two memory blocks maps to the same cache block and they are both locked then all other memory blocks mapping to that cache block wont be cached at all. The good news is that it’s simple to keep track of what blocks are locked in the cache. One mode can be used were locking a memory block in cache is prohibited if one of the two cache blocks that the memory block maps to is already locked. A more aggressive mode can also be used were previous locking is ignored.
Unlocking is also interesting. Both locking and unlocking of the cache can be done at run-time therefore opening the door for a dynamic approach. For instance if the interrupt handler is locked in the cache but an interrupt isn’t called for a certain amount of time then maybe the interrupt handler should be unlocked from the cache and the cached freed up for other processes.
The PowerPC also has a “touch” instruction for the Data cache; it gives a hint to the system that the block will soon be used. But to be able to use this you must have a predictable system and the problem dealing with interrupts is that they can occur at any time. Another way of optimizing would be by cache-inhibit memory addresses. Cache inhabitation can be used when you want to keep certain instructions or data out of the cache. For instance code that isn’t time critical and running in low priority can be kept out of the cache and keeping entries in the cache for processes that maybe are in a waiting state. (At the moment I’m not sure if the PowerPC supports cache-inhibition, the MPC860 manual hint that it does but there is no description on how to do it.)
Evaluation
Measuring the effects of cache locking is not trivial. By locking for instance
the interrupt handler into the cache should lower the interrupt latency should
lower but this will leave a smaller cache to the rest of the system and affect
the overall performance negatively. So
measuring the interrupt latency isn’t enough, the effects of the entire system
must be measured. And the effects can be different depending on the system, in
a really time critical system you’re prepared to accept a slower overall
performance in order to have faster interrupt handling. The best way of
measuring must be by using real life examples and compare performance with
cache locking enabled and disabled.
Predictions
When locking part of the cache you decrease the size of the cache for other processes. Even though the instructions you lock will always be available in the cache and can be executed at short notice the rest of the system generally runs slower. And the more you lock the more performance loss in the entire system. And the cache only being two-way set-associative complicates things even further. When locking one of the blocks in the set the other memory locations sharing that block will see a direct-mapped cache. And worse, if both blocks are locked, then the memory locations sharing that block won’t be cached at all. It’s certainly not the ultimate optimization to lock blocks in the cache but can it be effective in certain scenarios? OSE being a RTOS means that reduction of all latencies is important and handling interrupts is crucial because interrupts are unpredictable. Interrupts can occur at any given time and must be handled as soon as possible to return control back to the other processes. If we can lower the interrupt latency by locking part of or the entire interrupt handler into the cache it can be worth decreasing the speed of the rest of the system. The other scenario is if we have a lot of interrupts and constantly having the interrupt handler in the memory prevents constant fetching of it from the slower main memory. Different scenarios will show different results; in certain cases we can have really bad overall performance, though locking part of the interrupt handler will never result in worse interrupt latency. The conclusion must be that using locking must be optional to the application designer and should not be default.
After searching the Internet for work related to cache locking I’m disappointed to say that I couldn’t find that much. The only sign that cache locking is being used is in the eCos real-rime operating system developed and distributed by RedHat Inc were you have functions similar to those I’m trying to write for locking cache blocks in OSE.
What’s a real-time operating system anyway: “A Real-time System is a program that must respond to external events within a limited time. A real-time operating system is a platform for supporting real-time applications.” (OSE Documentation 4.2, Volume 1, Page 1.9)
Interrupts are events from hardware, it can be the serial port receiving something or it can be a hard drive that finished fetching a file to the memory. Interrupts were created to let the CPU perform other work while waiting for something to be completed.
Cache is a smaller but faster memory physically located much closer to the CPU than the main memory. The idea is when the CPU reads something from the slower main memory a copy is saved in the cache memory for quicker access if the processor wants the same information again. Since loops are very common in programming the chances are great that the next time the CPU wants to execute the same code again or needs the same data again it already exists in the cache thus the CPU does not need to fetch the data from the main memory. Modern processors usually have two levels of cache and some even more. The setup of the cache is different on different architectures and models.
By two-way associative we mean that each address in the memory can save in two different locations in the cache. The two extreme cases are direct mapped cache where a memory location can only be saved in on place in the cache and full-way associatively where each memory block can be located on any place in the cache. Usually a modern CPU has 4 or 8-way associatively, the higher associatively the more complex hardware and more components needed on the CPU.
Instruction and data cache, why have separate caches? First of all, instructions don’t change during run-time while data change constantly. If a block of data is changed in the CPU not only must the cache entry be updated but also the main memory, There are two ways of solving this, either write-through were the main memory is updated instantly or by write-back were the main memory isn’t updated until the block is swapped out.
LRU (Least-Recently-Used) is a scheme for cache block replacement. When a read from main memory is to be written into the cache some block in the cache has to be thrown out. The LRU specifies how to choose which block to throw out. In our case with two-way associatively it’s pretty simple. The last block to be used stays and the other gets thrown out.
OSE Documentation 4.2
MPC860 User’s Manual
PowerPC Microprocessor Family: The Programming Environments For 32-bit Microprocessors