Optimizing Your Code

Though getting the software to work correctly seems like the logical last step for a project, this is not always the case in embedded systems development. The need for low cost versions of our products drives hardware designers to provide just barely enough memory and processing power to get the job done. Of course, during the software development phase of the project it is more important to get the program to work correctly. And toward that end there are usually one or more “development” boards around, each with additional memory and/or a faster processor. These boards are used to get the software working correctly, and then the final phase of the project becomes code optimization. The goal of this final step is to make the working program run on the lower-cost “production” version of the hardware.

Increasing Code Efficiency

Some degree of code optimization is provided by all modern C and C++ compilers. However, most of the optimization techniques that are performed by a compiler involve a tradeoff between execution speed and code size. Your program can either be made faster or smaller, but not both. In fact, an improvement in one of these areas may have a negative impact on the other. It is up to the programmer to decide which of these improvements is most important to her. Given that single piece of information, the compiler’s optimization phase can make the appropriate choice whenever a speed vs. size tradeoff is encountered.

Since you can’t have the compiler perform both types of optimization for you, I recommend letting it do what it can to reduce the size of your program. Execution speed is usually important only within certain time-critical and/or frequently-executed sections of the code. And there are many things you can do to improve the efficiency of those sections by hand. But code size is a difficult thing to influence manually, and the compiler is in a much better position to make this change across all of your software modules.

By the time your program is working you may already know, or have a pretty good idea, which subroutines and modules are the most critical for overall code efficiency. Interrupt service routines, high-priority tasks, calculations with real-time deadlines, and functions that are either compute-intensive or frequently-called are all likely candidates. A tool called a profiler, included with some software development suites, can be used to narrow your focus to those routines in which the program spends most (or too much) of its time.

Once you’ve identified the routines that require greater code efficiency, one or more of the following techniques can be used to reduce their execution time.

Inline Functions

In C++, the keyword inline can be added to any function declaration. This keyword makes a request to the compiler to replace all calls to the indicated function with copies of the code that is inside. This eliminates the run-time overhead associated with the actual function call and is most effective when the inline function is called frequently but contains only a few lines of code.

Inline functions provide a perfect example of how execution speed and code size are sometimes inversely linked. The repetitive addition of the inline code will increase the size of your program in direct proportion to the number of times the function is called. And, obviously, the larger the function, the more significant the size increase will be. The resulting program runs faster, but now requires more ROM in which to be stored.

Table Lookups

A switch statement is one common programming technique to be used with care. Each test and jump that makes up the machine language implementation uses up valuable processor time simply deciding what work should be done next. To speed things up, try to put the individual cases in order by their relative frequency of occurrence. In other words, put the most likely cases first and the least likely cases last. This will reduce the average execution time, though it will not improve at all upon the worst-case time.

If there is a lot of work to be done within each case, it may be more efficient to replace the entire switch statement with a table of pointers to functions. For example, the following block of code is a candidate for this improvement.

enum NodeType { NodeA, NodeB, NodeC };

switch (getNodeType())
{
    case NodeA:
        .
        .
    case NodeB:
        .
        .
    case NodeC:
        .
        .
}

To speed things up, we would replace the switch statement above with the following alternative. The first part of this is the setup: the creation of an array of function pointers. The second is a one-line replacement for the switch statement that executes more efficiently.

int processNodeA(void);
int processNodeB(void);
int processNodeC(void);

/*
 * Establishment of a table of pointers to functions.
 */
int (*func)()  nodeFunctions[] = { processNodeA, processNodeB, processNodeC }; 

.
.

/*
 * The entire switch statement is replaced by the next line.
 */
status = nodeFunctions[getNodeType()]();

Hand-Coded Assembly

Some software modules are best written in assembly language. This gives the programmer an opportunity to make them as efficient as possible. Though most C/C++ compilers produce much better machine code than the average programmer, a good programmer can still do better than the average compiler for a given function. For example, early in my career I implemented a digital filtering algorithm in C and targeted it to a TI TMS320C30 DSP. The compiler we had back then was either unaware or unable to take advantage of a special instruction that performed exactly the mathematical operations I needed. By manually replacing one loop of the C program with inline assembly instructions that did the same thing, I was able to decrease the overall computation time by more than a factor of ten.

Register Variables

The keyword register can be used when declaring local variables. This asks the compiler to place the variable into a general-purpose register, rather than on the stack. Used judiciously, this technique provides hints to the compiler about the most frequently accessed variables and will somewhat enhance the performance of the function. The more frequently the function is called, the more likely such a change is to improve the code’s performance.

Global Variables

It is more efficient to use a global variable than to pass a parameter to a function. This eliminates the need to push the parameter onto the stack before the function call and pop it back off once the function is completed. In fact, the most efficient implementation of any subroutine would have no parameters at all. However, the decision to use a global variable may also have some negative effects on the program. The software engineering community generally discourages the use of global variables, in an effort to promote the goals of modularity and reentrancy, which are also important considerations.

Polling

Interrupt service routines are often used to improve program efficiency. However, there are some rare cases in which the overhead associated with the interrupts actually causes an inefficiency. These are cases in which the average time between interrupts is of the same order of magnitude as the interrupt latency. In such cases it may be better to use polling to communicate with the hardware device. Of course, this too leads to a less modular software design.

Fixed-Point Arithmetic

Unless your target platform includes a floating-point coprocessor, you’ll pay a very large penalty for manipulating float data in your program. The compiler-supplied floating-point library contains a set of software subroutines that emulate the instruction set of a floating-point coprocessor. Many of these functions take a very long time to execute relative to their integer counterparts, and may also not be reentrant.

If you are only using floating-point for a few calculations, it may be better to reimplement the calculations themselves using fixed-point arithmetic only. Although it may be difficult to see just how this might be done, it is theoretically possible to perform any floating-point calculation with fixed-point arithmetic. (After all, that’s how the floating-point software library does it, right?) Your biggest advantage is that you probably don’t need to implement the entire IEEE 754 standard just to perform one or two calculations. If you do need that kind of complete functionality, stick with the compiler’s floating-point library and look for other ways to speed up your program.

Decreasing Code Size

As I said earlier, when it comes to reducing code size your best bet is to let the compiler do the work for you. However, if the resulting program is still too large for your available ROM, there are several programming techniques you can use to further reduce the size of your program. In this section we’ll discuss both automatic and manual code size optimizations.

Of course, Murphy’s Law dictates that the first time you enable the compiler’s optimization feature your previously working program will suddenly fail. Perhaps the most notorious of the automatic optimizations is “dead code elimination.” This optimization eliminates code that the compiler believes to be either redundant or irrelevant. For example, adding zero to a variable requires no run-time calculation whatsoever. But you may still want the compiler to generate those “irrelevant” instructions if they perform some function that the compiler doesn’t know about.

For example, given the following block of code, most optimizing compilers would remove the first statement because the value of *pControl is not used before it is overwritten (on the third line).

*pControl = DISABLE;
*pData     = 'a';
*pControl = ENABLE;

But what if pControl and pData are actually pointers to memory-mapped device registers? In that case, the peripheral device would not receive the DISABLE command before the byte of data is written. This could potentially wreak havoc on all future interactions between the processor and this peripheral. To protect yourself from such problems, you must declare all pointers to memory-mapped registers and global variables that are shared between threads (or a thread and an ISR) with the keyword volatile. And if you miss just one of them, Murphy’s Law will come back to haunt you in the final days of your project. I guarantee it.

Never make the mistake of assuming that the optimized program will behave the same as the unoptimized one. You must completely retest your software at each new optimization level to be sure its behavior hasn’t changed.

To make matters worse, debugging an optimized program is challenging, to say the least. With the compiler’s optimization enabled, the correlation between a line of source code and the set of processor instructions that implements that line is much weaker. Those particular instructions may have moved or been split up, or two similar code blocks may now share a common implementation. In fact, some lines of the high-level language program may have been removed from the program altogether (as they were in the previous example)! As a result, you may be unable to set a breakpoint on a particular line of the program or examine the value of a variable of interest.

Once you’ve got the automatic optimizations working, here are some tips for further reducing the size of your code by hand.

Avoid Standard Library Functions

One of the best things you can do to reduce the size of your program is to avoid using large standard library routines. Many of the largest are expensive only because they try to handle all possible cases. It might be possible to implement a subset of the functionality yourself with significantly less code. For example, the standard C library’s sprintf routine is notoriously large. Much of this bulk is located within the floating-point manipulation routines on which it depends. But if you don’t need to format and display floating-point values (%f or %d), you could write your own integer-only version of sprintf and save several kilobytes of code space. In fact, a few implementations of the standard C library (Cygnus’ newlib comes to mind) include just such a function, called sprint.

Native Word Size

Every processor has a native word size and the ANSI C and C++ standards state that data type int must always map to that size. Manipulation of smaller and larger data types may sometimes require the use of additional machine-language instructions. By consistently using int whenever possible in your program, you might be able to shave a precious few hundred bytes from your program.

Goto Statements

Like global variables, good software engineering practice dictates against the use of goto statements. But in a pinch, a goto can be used to remove complicated control structures or to share a block of oft-repeated code.

In addition to these techniques, several of the ones described in the previous section may be helpful, specifically table lookups, hand-coded assembly, register variables, and global variables. Of these, the use of hand-coded assembly will usually yield the largest decrease in code size.

Reducing Memory Usage

In some cases, it may be RAM rather than ROM that is the limiting factor for your application. In these cases, you’ll want to reduce your dependence on global data, the stack, and the heap. These are all optimizations better made by the programmer than the compiler.

Since ROM is usually cheaper than RAM (on a per-byte basis), one acceptable strategy for reducing the amount of global data may be to move constant data into ROM. This can be done automatically by the compiler if you declare all of your constant data with the keyword const. Most C/C++ compilers place all of the constant global data they encounter into a special data segment that is recognizable to the locator as ROM-able. This technique is most valuable if there are lots of strings or table-oriented data that does not change at run-time.

If some of the data is fixed once the program is running but not necessarily constant, the constant data segment could be placed in a hybrid memory device instead. This memory device could then by updated over a network or by a technician assigned to make the change. An example of such data is the sales tax rate for each locale in which your product will be deployed. If a tax rate changes, the memory device can be updated, but additional RAM can be saved in the meantime.

Stack size reductions can also lower your program’s RAM requirement. One way to figure out exactly how much stack you need is to fill the entire memory area reserved for the stack with a special data pattern. Then, after the software has been running for a while—preferably under both normal and stressful conditions—use a debugger to examine the modified stack. The portion of the stack memory area that still contains your special data pattern has never been overwritten. So it is safe to reduce the size of the stack area by that amount.

Be especially conscious of stack space if you are using a real-time operating system. Most operating systems create a separate stack for each task. These stacks are used for function calls and interrupt service routines that occur within the context of that task. You can determine the amount of stack required for each task stack in the manner described above. You might also try to reduce the number of tasks or switch to an operating system that has a separate “interrupt stack” for execution of all interrupt service routines. The latter can significantly reduce the stack size requirement of each task.

The size of the heap is limited to the amount of RAM left over after all of the global data and stack space has been allocated. If the heap is too small, your program will not be able to allocate memory when it is needed. So always be sure to compare the result of malloc or new with NULL before dereferencing it. If you’ve tried all of these suggestions and your program is still requiring too much memory, you may have no choice but to eliminate the heap altogether. Unfortunately, this is possible only if you are using C; the new and delete operators are built-in features of the C++ language, rather than part of the standard library like malloc and free.

Limiting the Impact of C++

One of the biggest issues I faced upon deciding to write this book was whether or not to include C++ in the discussion. Despite my familiarity with C++, I had written almost all of my embedded software in C and assembly. In addition, there has been much debate within the embedded software community about whether C++ is worth the performance penalty. It is generally agreed that C++ programs produce larger executables that run more slowly than programs written entirely in C. However, C++ also has many benefits for the programmer. And I wanted to talk about some of those benefits in the book. So I ultimately decided to include C++ in the discussion, but to use only those features with the least performance penalty in the examples.

I believe that many readers will face the very same issue in their own embedded systems programming. So before ending the book, I wanted to briefly justify each of the C++ features I have used and to warn you about some of the more expensive features that I did not.

The Embedded C++ Standard

You might be wondering why the creators of the C++ language included so many expensive—in terms of execution time and code size—features. You are not alone; people around the world have wondered the same thing—especially the users of C++ for embedded programming. Many of these expensive features are recent additions that are neither strictly necessary nor part of the original C++ specification. These features have been added one by one as part of the ongoing “standardization” process.

In 1996, a group of Japanese processor vendors joined together to define a subset of the C++ language and libraries that is better suited for embedded software development. They call their new industry standard Embedded C++. Surprisingly, for its young age, it has already generated a great deal of interest and excitement within the C++ user community.

A proper subset of the draft C++ standard, Embedded C++ omits pretty much anything that can be left out without limiting the expressiveness of the underlying language. This includes not only expensive features like multiple inheritance, virtual base classes, run-time type identification, and exception handling, but also some of the newest additions like templates, namespaces, and new-style casts. What’s left is a simpler version of C++ that is still object-oriented and a superset of C, but with significantly less run-time overhead and smaller run-time libraries.

A number of commercial C++ compilers already support the Embedded C++ standard specifically. Several others allow you to manually disable individual language features, thus enabling you to emulate Embedded C++ or create your very own flavor of the C++ language.

Of course, not everything introduced in C++ is expensive. In fact, many older C++ compilers incorporate a technology called C-front that turns C++ programs into C and feeds the result into a standard C compiler. The mere fact that this is possible should suggest that the syntactical differences between the languages have little or no run-time cost associated with them. It is only the newest C++ features, like templates, that cannot be handled in this manner.

For example, the definition of a class is completely benign. The list of public and private member data and functions are not much different than a struct and a list of function prototypes. However, the C++ compiler is able to use the public and private keywords to determine which method calls and data accesses are allowed and disallowed. Since this determination is made at compile-time there is no penalty paid at run-time. The addition of classes alone does not significantly affect either the code size or efficiency of your programs.

Default parameter values are also penalty-free. The compiler simply inserts code to pass the default value whenever the function is called without an argument in that position. Similarly, function name overloading is a compile-time modification. Functions with the same names but different parameters are each assigned unique names during the compilation process. The compiler alters the function name each time it appears in your program and the linker matches them up appropriately. I haven’t used this feature of C++ in any of my examples, but I could have done so without affecting performance.

Operator overloading is another feature I could have used but didn’t. Whenever the compiler sees such an operator it simply replaces it with the appropriate function call. So in the code listing below, the last two lines of C++ code are equivalent once the compiler is done with them.

Complex  a, b, c;

c = operator+(a, b);     // The traditional way: Function Call
c = a + b;                     // The C++ way: Operator Overloading

Unlike the features mentioned so far, constructors and destructors do have a slight penalty associated with them. These special methods are guaranteed to be called each time an object of the type is created or goes out of scope, respectively. However, this small amount of overhead is a reasonable price to pay for fewer bugs. Constructors eliminate an entire class of C programming errors having to do with uninitialized data structures. This feature has also proven useful for hiding the awkward initialization sequences associated with complex classes like Timer and Task.

Virtual functions also have a reasonable cost/benefit ratio. Without going into too much detail about what virtual functions are, let’s just say that polymorphism would be impossible without them. And without polymorphism, C++ would not be a true object-oriented language. The only significant cost of virtual functions is one additional memory lookup before a virtual function can be called. Ordinary function and method calls are not affected.

The features of C++ that are too expensive for my taste are templates, exceptions, and run-time type identification. All three of these negatively impact code size, and exceptions and run-time type identification also increase execution time. Before deciding whether to use these features you might want to do some experiments to see how they will affect the size and speed of your own application.