XMOS event-driven 32-bit, embedded processors significantly lower product development time and bill of materials cost. The multithreaded processor combines the code efficiency of a RISC processor, the...
Read More
Usually embedded systems programmers and boxers do not draw many comparisons. However, in one aspect they do. Boxers need to fight within a particular weight limit—so their challenge is to maximize their performance (build muscle) within their specified allowance. Embedded systems are similar; they only have a limited amount of memory and you need to maximize performance within those limits.
Often, code optimizations that improve performance also reduce memory size (by reducing the amount of code that needs to be stored). With risk of overstretching a rather tortuous analogy, perhaps this is similar to boxers shedding fat. However, some optimizations require a trade-off; it can go faster or it can be smaller. Making this decision is hard and depends on the application we are writing, but it is worth being aware of what optimizations fall into this category. This article covers some of the major space vs. speed trade-off optimizations we are likely to come across.
Suppose we want to calculate sin(x/256) for an unsigned 8-bit value x (i.e., the input represents values in the range 0.0 to 1.0). One method would be to calculate a polynomial approximation of the function. This is likely to be in the order of 5-20 instructions, depending on the architecture.
Another option would be to have a lookup table. The size of this table will depend on the required output precision, but say we wanted 16 bits of output, then the table would be 256 * 16 bits = 512 bytes. This could reduce the calculation to 1 or 2 instructions but at the expense of memory.
In the above case it may be worth the extra memory if the speed is required for those sin calculations (remember – profile first, optimize later). However, in some sense this case is fortunate in that the input is only 8-bits. If the input were 16-bits then you would require 128KB of memory (more than on many microcontrollers).
The size of a lookup table goes up linearly with the number of bits in the output and exponentially with the number of bits in the input. So the number of bits in the input is the key factor.
For some algorithms, it is also possible to have hybrid techniques where we do both a lookup and some calculation. In these cases we can sometimes trade off the number of bits we use for lookup against the amount of calculation we do afterwards. An example of this is using a lookup to get an initial estimate of a function (e.g. 1/x) and then using iterative refinement to get an accurate solution.
Loop unrolling is the transformation of a loop to a new loop whose body executes several iterations of the original loop. Consider the following code:
It can be rewritten as a loop that iterates only five times but each iteration does two operations:
You could also fully unroll the loop to get this code:
Why is this a good thing? There are a couple reasons why an unrolled loop (and particularly a fully unrolled loop) may go faster:
Generally, compilers can do the unrolling for you. In the XMOS compiler you can control the unrolling of loops in XC programs by using pragmas. The following example will cause the compiler to unroll the loop 10 times:
The following is also equivalent:
The unroll pragma with an argument will cause the compiler to fully unroll the loop if it can. This is particularly useful if the number of times a loop iterates varies with a #define but you always want to fully unroll the loop. Following is an example:
The definition of g could be replaced with
by taking the body of f and replicating it inside g.
This method might cause code size increase if the function being inlined is used more than once since the body of the function can be much larger than the call. The advantages of inlining a function are as follows:
Of these three, it is arguable that in most cases the third is the most important. Sometimes inlining is seen as a magic wand that can automatically make your code go faster, but often it can be marginal unless fitting the callee’s function code into the caller will allow a compiler to optimize more. Where it is often a big win is when the function being inlined is very small (in this case better register usage is likely).
C and similar languages (such as the XMOS derivative XC) have the inline keyword that can be added to function definitions. This function qualifier acts as a hint to the compiler that you would like this function to be inlined. It is just a hint though; compilers can inline functions without the keyword and can choose not to inline functions that have the keyword.
This article shows three of the major examples of code optimizations that increase code size: using lookup tables, loop unrolling, and function inlining. Deciding when to trade off performance against code-size is one of the tricky parts of code development but it is important to be aware of when you can do it, how you can do it, and what the benefits are.
Wire-to-board interconnection options from Sullins feature a wide range of sizes and applications
MCC’s TVS series high-power suppressors protect sensitive components from voltage spikes and transients
Evaluation boards that streamline evaluating circuit protection on RS-485 serial device ports
There are currently no comments.